Numerical Optimization in Robots (6) - Line Search Steepest Descent Method

   This series of articles is mainly about some of my notes and related thoughts in the process of studying "Numerical Optimization". The main learning materials are the course "Numerical Optimization in Robots" from Deep Blue Academy and "Numerical Optimization Method" edited by Gao Li, etc. , this series of articles has a large number and is updated from time to time. The first half introduces unconstrained optimization, and the second half introduces constrained optimization. Some application examples of path planning will be interspersed in the middle.



   Eight, line search steepest descent method

   1. Introduction to the steepest gradient descent method

   Gradient descent is an iterative method that can be used to solve least squares problems (both linear and nonlinear). When solving the model parameters of machine learning algorithms, that is, unconstrained optimization problems, gradient descent (Gradient Descent) is one of the most commonly used methods. Another commonly used method is the least squares method. When solving the minimum value of the loss function, the gradient descent method can be used to iteratively solve the problem step by step to obtain the minimized loss function and model parameter values. On the other hand, if we need to solve for the maximum value of the loss function, then we need to use the gradient ascent method to iterate. In machine learning, two gradient descent methods have been developed based on the basic gradient descent method, namely stochastic gradient descent method and batch gradient descent method.

   The fastest gradient descent method uses the first-order information of the function to locally find a direction in which the function decreases the fastest, and then continuously approaches the local minimum along this direction.

   For a function with a gradient, the direction of steepest descent must be the opposite direction of its gradient (as shown by the blue arrow in the figure below)

   If the gradient exists, updating an x ​​along the opposite direction of the gradient will definitely be closer to the local minimum. The iteration format is as follows, where τ is the step size, ∇ f ( xk ) \nabla\text{} f \left(x^k\right)\quad\text{}f(xk)Is the gradient or minimum norm subgradient (the vector with the smallest module length in the subgradient set takes the opposite direction)

   x k + 1 = x k − τ ∇ f ( x k ) x^{k+1}=x^{k}-τ\nabla\text{}f\left(x^k\right)\quad\text{} xk+1=xkτf(xk)



   2. Process of fastest gradient descent method


   3. Selection of step size τ

   ① Strategy 1: τ is a fixed constant, such as 1, 0.1, 0.01, etc.

   ② Strategy 2: τ takes a decreasing amount, which decreases as the number of searches increases

   ③ Strategy 3: Exact line search, ideally, the step size of each search is to allow the cross-section of the multivariate function to reach the lowest point along the search direction, which is called the optimal step size, and the step size that decreases the most along the search direction. However, finding the optimal step size itself is an optimization problem.

   ④ Strategy 4: Inaccurate line search, weakening the conditions of strategy 3, so that the search step does not need to solve the sub-optimization problem, and can also search quickly


   Content supplement: The first-order directional derivative represents the rate of change of the function value along the direction d at this point, which can be expressed as the following form

   ∂ f ( x ) ∂ d = 1 ∥ d ∥ ∇ f ( x ) T d ; \frac{\partial f\left(x\right)}{\partial d}=\frac{1}{\left\|d\right\|}\nabla f(x)^{T}d; df(x)=d1f(x)Td;


   (1) Strategy ①, when τ is a fixed constant, if the step size is too large, it may oscillate and diverge; if the step size is too small, it may converge too slowly; when the step size is appropriate, it will converge quickly. Therefore, the fixed step size strategy needs to rely on experience to set an appropriate step size, as shown in the following figure:


   (2) Strategy ② has strong stability but slow convergence speed. It is generally used when the conditions of the function are very poor and there are no requirements on the solution rate and time.


   (3) 策略④,我们可以沿着搜索方向d,把周围的函数 f ( x k ) f(x^{k}) fxk解出一个一维的函数,这个函数的意思就是,当步长取α时,对应函数的高度就是图中曲线,φ(0)值是 f f f f ( x k ) f(x^{k}) fxk处的初始值

   如果仅是让函数下降的话,跟初始值φ(0)齐平以下的所有区域都可以选,如下图所示的0~α2区域,但是为了更快的下降,需要更严苛的条件,这个条件是跟梯度有关的,比如若局部极小值为1,而当前解为1.001,无论如何不能让函数的下降大于0.001,因此,我们要根据函数当前的梯度或者斜域来定充分下降的斜对数,它的斜率就是φ(0)的斜率,即搜索方向d与 x k x^{k} xk处梯度的点积 d T ∇ f ( x k ) d^{\mathrm{T}}\nabla f(x^{k}) dTf(xk),再乘以一个0~1的系数c对其进行放松,得到一个更小的区间0 ~ α1,一般来说,我们需要找一个不接近于0的步长,在这个Armijo condition 区域内搜索一个较靠右的步长,即我们想要的步长。

   对于非凸函数的可接受区域如下图所示:


   4、最速下降法流程及策略③和④的比较

   给定一个x0,首先求他的梯度,取负梯度为它的搜索方向,然后利用二分法不断的二分α区间去找一个满足Armijo condition的步长α,然后接受他,去更新下一个x的位置,不断的循环,当f在xk处的梯度的模长足够小时,结束循环。(当不可微时,梯度改为次微分检验,即含零向量时,即可结束循环)


   策略③只有找到上图中的最低点时,才进行更新,而策略④只要找到的步长位于Armijo condition 区域内即可进行更新。这样会节省一些时间,而且更简单一些,在工程中策略④更常用

   从下图中可以看出,若采用精确线搜索(策略③),只需要寥寥几步更新就可以收敛较理想的状态,若采用充分下降线搜索(策略④)可能需要迭代多次更新,但是精确线搜索每次迭代花费算力较多,时间较长,而充分下降搜索耗时较少,所以总的花费时间≈单次耗时x迭代次数。两种策略的总耗时是近似的。


   在下图所示的这样一个100维的凸函数的例子中,当精度要求比较高时,如0.0001,两种策略的迭代次数近似,而策略③的每次迭代耗时多于策略④


   5、最速下降法的收敛速度

   u在G度量意义下的范数 ∥ u ∥ G 2 \|u\|_G^2 uG2定义为:(其中G为Hesse矩阵)

   ∥ u ∥ G 2 = u T G u . \|u\|_G^2={u}^\mathrm{T}Gu. uG2=uTGu.

   对正定二次函数,最速下降方法的收敛速度为

   ∥ x k + 1 − x ∗ ∥ G 2 ∥ x k − x ∗ ∥ G ⩽ ( λ max − λ min λ max + λ min ) 2 . \frac{\|x_{k+1}-x^*\|_G^2}{\|x_k-x^*\|_G}\leqslant\left(\frac{\lambda_{\text{max}}-\lambda_{\text{min}}}{\lambda_{\text{max}}+\lambda_{\text{min}}}\right)^2. xkxGxk+1xG2(λmax+λminλmaxλmin)2.

   上式中有 :(其中 cond ⁡ ( G ) = ∥ G ∥ ∥ G − 1 ∥ \operatorname{cond}(G)=\|G\|\|G^{-1}\| cond(G)=G∥∥G1称为矩阵G的条件数)

   λ max ⁡ − λ min ⁡ λ max ⁡ + λ min ⁡ = c o n d ( G ) − 1 c o n d ( G ) + 1 ≜ μ \frac{\lambda_{\max}-\lambda_{\min}}{\lambda_{\max}+\lambda_{\min}}=\frac{\mathrm{cond}(G)-1}{\mathrm{cond}(G)+1}\triangleq\mu λmax+λminλmaxλmin=cond(G)+1cond(G)1μ.

   由上式可以看出,最速下降方法的收敛速度依赖于G的条件数.当G的条件数接近于1时, u接近于零,最速下降方法的收敛速度接近于超线性收敛速度;而G的条件数越大,u越接近于1,该方法的收敛速度越慢.

   Hesse矩阵G的条件数的差异造成了最速下降方法对如下图所示的两个问题收敛速度的差异.在下图可以看出,最速下降方法相邻两步的迭代方向互相垂直,Hesse矩阵的条件数越大,二次函数一族椭圆的等高线越扁.可以想象,当目标函数的等高线为一族很扁的椭圆时,迭代在两个相互垂直的方向上交替进行.如果这两个方向没有一个指向极小点,迭代会相当缓慢,甚至收敛不到极小点.


   6、最速下降法的优缺点

   (1)缺点

   当一个凸函数的条件数等于2时,等高线是一系列的椭圆,他的梯度是垂直于椭圆的边界的,如果条件数很大,椭圆就很扁,用最速下降法来迭代就会产生一些震荡。


   当条件数更大,如100时,椭圆会更扁,由于梯度方向与等高线垂直,导致梯度方向近似于平行,需要震荡很久才能收敛到局部极小值。所以当函数的曲率很大,或者条件数很大的时候,采用梯度下降法可能需要很多的迭代次数。


   下图是一个二维的二次函数的例子,从图中可以看出,随着条件数的增大,收敛所需的迭代次数也随之增加


   (2)优点

   最速下降方法的优点是:算法每次迭代的计算量少,存储量亦少; 即使从一个不太好的初始点出发,算法产生的迭代点也可能接近极小点.



   参考资料:

   1、机器人中的数值优化

   2、梯度下降

   3、数值最优化方法(高立 编著)


Guess you like

Origin blog.csdn.net/qq_44339029/article/details/128688272