Explain the gradient descent algorithm in detail

1. What is the gradient descent algorithm

The gradient descent method (Gradient descent) is a first-order optimization algorithm, also known as the steepest descent method. To use the gradient descent method to find the local minimum of a function, it is necessary to iteratively search for the specified step distance point in the opposite direction of the gradient (or approximate gradient) corresponding to the current point on the function. If iteratively searches in the positive direction of the gradient instead, it will approach the local maximum point of the function; this process is called the gradient ascent method, and the opposite is called the gradient descent method.

1.1 Image Understanding

Gradient descent can be understood as you stand somewhere on the mountain and want to go down the mountain. At this time, the fastest way to go down the mountain is to look around, where is the steepest, where to go down the mountain, and keep implementing this strategy. After the N cycle, you have reached the lowest point of the mountain. As shown in the picture above, if it is a longitudinal section of the mountain, then you can take a small step down the mountain every time, and you can reach the bottom of the mountain after N times
insert image description here
.
insert image description here
For 3D images, there are also similar steps such that the foot of the mountain is reached after N steps.

1.2 Mathematical Understanding - Differentiation

After clarifying what is the gradient descent algorithm, it is necessary to convert it into a mathematical formula or method, so as to use the computer to solve it, and then obtain the algorithm model that meets our needs. For the function of a single variable, such as y = x^2, using the knowledge of the quadratic function in junior high school, you can quickly understand that there is a minimum value, and the minimum value is (0, 0). In order to be able to deal with complex functions, or multivariate functions, such as y(x, y)=x^2+y^2, or even thousands of dimensional functions in the neural network, it is quite complicated to use formulas to solve them, and to differentiate them. The differential reflects the increment, which is exactly the direction we need to go downhill the fastest in the gradient descent algorithm.
Such as: d ( x 2 ) d ( x ) = 2 x \frac{d(x^2) }{d(x)}=2xd(x)d(x2)=2 x
For complex functions, as shown in the figure below y=sin(x)+cos(y), at this time, due to the existence of double variables x and y, it is still being differentiated. At this time, the derivative obtained is the vector of the function (x is a derivative, y is a derivative, which together is a vector). is an increment) with a binary function z = f ( x , y ) ​ z=f\left ( x,y\right)
insert image description here
z=f(x,y)​For example , suppose it has continuous first-order partial derivatives for each variable∂ z ∂ x ​ \frac{\partial z} {\partial x}xz ∂ z ∂ y ​ \frac{\partial z} {\partial y}​ yz​, then the vector formed by these two partial derivatives[ ∂ z ∂ x , ∂ z ∂ y ] \left [ \frac{\partial z} {\partial x},\frac{\partial z} {\partial y} \right ][xz,yz] ​is the gradient vector of the binary function, generally recorded as∇ f ( x , y ) ​ \nabla f\left ( x,y \right )f(x,y)​.
Therefore: in the univariate function, the gradient represents the change of the slope of the image, in the multivariate function, the gradient represents the vector, and the fastest changing place is the steepest direction

1.3 Step size (learning rate) - aaa

I have been discussing how to go down the mountain the fastest and how to use mathematical methods to solve the fastest way down the mountain and the direction of going down the mountain. Then I have overlooked a problem, which is the steps down the mountain. Of course, if the steps are too large, it is easy to pull the eggs, and if the steps are too small, it is too slow to go down the mountain. It may be that the sun sets when going down the mountain, so you need to determine a step length aaa , making it possible to go downhill smoothly and fastest after going through appropriate steps. Maybe you can also think that the best way is to go down the mountain with big steps first, take small steps at the lowest point of the mountain, and keep approaching the lowest point. But if it is infinitely close to the last 0.000001 at the lowest point, it is meaningless in practical sense at this time. Therefore, a certain value needs to be determined at the same time, so that after a certain iteration, the size of the judgment and set value is judged. If it is less than, the loop will stop.
Comparison of different step lengths
Small step size
insert image description hereand small step size represent a large amount of calculation and take a long time, but they are more accurate.

Big step and
insert image description herebig step size, i.e. larger aaa , the performance is oscillating, it is easy to miss the lowest point, and the amount of calculation is relatively small.
Note: Due to the concave-convexity of the function, the optimal solution can be infinitely approached for the convex function, and only the local optimal solution can be obtained for the non-convex function.
insert image description here
As shown in the above figure, for different learning rates or step sizesα \alphaα , it has different paths downhill (path A and path B), so there are different solutions, which are called local optimal solutions.

1.4 Implementation of Gradient Descent Algorithm

After determining the direction and size of the descent, the gradient descent algorithm can be implemented. Similarly, before going down the mountain, we assume that A (x, y, easy to explain, this article uses 2-dimensional coordinates, the same reason for higher dimensions), then only A − a Δ ​ Aa\
DeltaAa Δ ​means
taking a small step down each time, as we have discussed before, for the function, at this timeΔ ​ \DeltaΔ ​cannot represent the direction, it should be represented by the gradient, ie∇ \nabla,ie:
A − a ∇ ​ Aa\nablaAa
​After calculating α
∗ ∇ J ( θ ) θ=θ−\alpha∗\nabla J(θ)i=θ α J ( θ )
After specifying the formula, the steps of the general gradient descent algorithm are:

1、给定待优化连续可微分的函数J(θ),学习率或步长a,以及一组初始值(真实值)
2、计算待优化函数梯度
3、更新迭代
4、再次计算新的梯度
5、计算向量的模来判断是否需要终止循环

在python中,需要将一维的数字和公式转化为矩阵的形式,这能显著提升算法的运行效率和计算时间
假设我们要对简单线性回归进行拟合,从高中知识或大学知识我们可得,简单线性回归其实就是找出一条直线y=kx+b,使得尽可能的穿过多的点,如下图:
insert image description here
明显,直线C为最优的拟合直线,则其公式为:
J ( Θ ) = 1 2 m ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) 2 ​ J\left ( \Theta \right )=\frac{1}{2m}\sum_{i=1}^{n}\left ( h_{\theta } \left ( x^{\left ( i \right )}\right )-y^{\left ( i \right )} \right )^{2}​ J(Θ)=2m1i=1n(hθ(x(i))y(i))2
​Where 1 2 m \frac{1}{2m}2 m1It is convenient for differentiation and has no effect on the result. The gradient calculation formula:
∂ J ( Θ ) ∂ θ j = 1 n ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) xj ( i ) \frac{ \partial J\left ( \Theta \right )}{\partial \theta _{j}}=\frac{1}{ n}\sum_{i=1}^{n}\left ( h_{\theta } \left ( x^{\left ( i \right )}\right )-y^{\left ( i \right )} \right )x_{j}^{\left (i \right )}θjJ( Θ )=n1i=1n(hi(x(i))y(i))xj(i)
迭代公式:
θ = θ − α ∗ ∇ J ( θ ) θ=θ−\alpha∗\nabla J(θ)i=θ α J ( θ )

1.5 Types of Gradient Descent Algorithms

1.5.1 Batch Gradient Descent Algorithm

The gradient descent algorithm formula used in the previous discussion is:
∂ J ( Θ ) ∂ θ j = 1 n ∑ i = 1 n ( h θ ( x ( i ) ) − y ( i ) ) xj ( i ) \frac{ \partial J\left ( \Theta \right )}{\partial \theta _{j}}=\frac{1}{n}\ sum_{i=1}^{n}\left ( h_{\theta } \left ( x^{\left ( i \right )}\right )-y^{\left ( i \right )} \right )x_{j}^{\left (i \right )}θjJ( Θ )=n1i=1n(hi(x(i))y(i))xj(i)
It can be seen that the computer will calculate the gradient from all the data each time, and then calculate the average value as the gradient of one iteration. For high-dimensional data, the calculation amount is quite large. Therefore, this gradient descent algorithm is called batch gradient descent algorithm .

1.5.2 Stochastic Gradient Descent Algorithm

The stochastic gradient descent algorithm uses the shortcomings of the batch gradient descent algorithm to calculate all the data each time, and randomly selects a certain data to calculate the gradient as the gradient of this iteration. The gradient calculation formula: ∂ J ( Θ ) ∂ θ j = ( h θ ( x ( i ) ) − y ( i ) ) xj ( i ) \frac{\partial J\left ( \Theta \right )}{\partial \theta _{j} }=\left ( h_{\theta }\left ( x^{\left ( i \right )} \right )-y^{\left ( i \right )} \right )x_{j}^{\left ( i \right )
}θjJ( Θ )=(hi(x(i))y(i))xj(i)
Define:
θ = θ − α ⋅ ▽ θ J ( θ ; x ( i ) ; y ( i ) ) \theta =\theta -\alpha \cdot \triangledown _{\theta }J\left ( \theta;x^{\left ( i \right )} ;y^{\left ( i \right )} \right );i=iaiJ( i ;x(i);y( i ) )
Due to the random selection of a certain point, the process of summation and averaging is omitted, which reduces the computational complexity and improves the calculation speed, but due to the random selection, there is a large shock.

1.5.3 Mini-batch Gradient Descent Algorithm

The small-batch gradient descent algorithm combines the advantages and disadvantages of the batch gradient descent algorithm and the stochastic gradient descent algorithm. A part of the data in the sample is randomly selected. The gradient calculation formula: ∂ J ( Θ ) ∂ θ j = 1 k ∑ ii + k ( h θ ( x ( i ) ) − y ( i ) ) xj ( i ) \frac{\partial J\left ( \Theta \right ) }{\partial \ theta _{j}}=\frac{1}{k}\sum_{i}^{i+k}\left ( h_{\theta } \left ( x^{\left ( i \right )}\right )-y^{\left ( i \right )} \right )x_{j}^{\left (i \right )
}θjJ( Θ )=k1ii+k(hi(x(i))y(i))xj(i)
迭代公式:
θ = θ − α ⋅ ▽ θ J ( θ ; x ( i : i + k ) ; y ( i : i + k ) ) \theta =\theta -\alpha \cdot \triangledown _{\theta }J\left ( \theta ;x^{\left ( i:i+k \right )};y^{\left ( i:i+k \right )} \right ) i=iaiJ( i ;x(i:i+k);y( i : i + k ) )
is usually the most commonly used small-batch gradient descent algorithm, which has fast calculation speed and stable convergence.

Guess you like

Origin blog.csdn.net/JaysonWong/article/details/119818497