Unconstrained Optimization - Steepest Descent Algorithm and Matlab Implementation

0 Preface

  I've been too busy recently, so updates are a bit slow. This article will update a commonly used algorithm - the steepest descent method. In fact, this method will be mentioned in numerical analysis. This article first reviews the concept of gradient, then describes the steepest descent method, and finally understands the algorithm and implements code writing through examples.

I hope you can gain something after reading this article! ! ! !
Next article: Comparison and innovation between Matlab self-compiled Jacobian matrix (jacobi) function and official Jacobian matrix (Jacobian matrix) function

This time, we need to use some matlab functions in advance

Matlab version: 2020a, 2022a, (2022.6.5本人Matlab版本更新到了2022a)
suptitle cannot be used in matlab2022b version and directly suptitlechanged tosgtitle

  1. Use of the handle function (function_handle) 1 ——f=@(变量名) 函数表达式
  2. The use of matlabFunction 2
  3. Norm function norm
  4. Drawing three-dimensional: mesh, contour line: contour, grid generation: meshgrid
  5. Get each frame image: getframe, update: drawnow, pause delay: pause (these can be used as understanding)

1 Mathematics knowledge

  Understanding this steepest descent method requires some mathematical knowledge, so let's review some mathematical knowledge first.

必看:If students do not have a good grasp of mathematics knowledge, I suggest that you read all the content in the first section honestly, otherwise you will be confused about the derivation of formulas later.

1.1 The concept of gradient

There is a detailed explanation of this concept in reference 3 , here is the concept of gradient

  梯度:is a vector, indicating that the directional derivative of a function at this point obtains the maximum value along this direction. (1) With size: the rate of change is the largest (the modulus of the gradient); (2) With direction: the function changes the fastest along this direction (the direction of this gradient) at this point 4 .

  Definition: Let z = f ( x , y ) z = f(x,y)z=f(x,y ) at pointP 0 ( x 0 , y 0 ) P_0(x_0,y_0)P0(x0,y0) there is a partial derivativefx ′ ( x 0 , y 0 ) f'_x(x_0,y_0)fx(x0,y0)fy ′ ( x 0 , y 0 ) f'_y(x_0,y_0)fy(x0,y0), 则称向量 { f x ′ ( x 0 , y 0 ) , f y ′ ( x 0 , y 0 ) } \{f'_x(x_0,y_0) ,f'_y(x_0,y_0)\} { fx(x0,y0),fy(x0,y0)} f ( x , y ) f(x,y) f(x,y )P 0 ( x 0 , y 0 ) P_0(x_0,y_0)P0(x0,y0) , denoted as ∇ f ∣ P 0 , ∇ z ∣ P 0 , gradf ⁡ ∣ P 0 or gradz ⁡ ∣ P 0 \left.\nabla f\right|_{P_{0}},\left.\ nabla z\right|_{P_{0}},\left.\operatorname{gradf}\right|_{P_{0}} \text { or}\left.\operatorname{gradz}\right|_{P_ {0}}fP0,zP0,gradfP0 or gradzP0
∴ ∇ f ∣ P 0 = gradf ⁡ ∣ P 0 = { f x ′ ( x 0 , y 0 ) , f y ′ ( x 0 , y 0 ) } \therefore \left.\nabla f\right|_{P_{0}}=\left.\operatorname{gradf}\right|_{P_{0}}=\{f'_x(x_0,y_0) ,f'_y(x_0,y_0)\} fP0=gradfP0={ fx(x0,y0),fy(x0,y0)}

where ∇ \nabla (Nabla)算子: ∇ = ∂ ∂ x i + ∂ ∂ y j \nabla = \frac{\partial}{\partial x}i + \frac{\partial}{\partial y}j =xi+yj

1.2 Gradient function

  Definition: If f ( x , y ) f(x,y)f(x,y ) inDDThere are partial derivatives everywhere in D , then ∇ f = { fx ′ ( x , y ) , fy ′ ( x , y ) } \nabla f= \{f'_x(x,y) ,f'_y(x, y)\}f={ fx(x,y),fy(x,y)} f ( x , y ) f(x,y) f(x,y ) Gradient function in D

(1) Definition: ∣ ∇ f ∣ = [ fx ′ ( x , y ) ] 2 + [ fy ′ ( x , y ) ] 2 |\nabla f|=\sqrt{\left[f_{x}^ {\prime}(x, y)\right]^{2}+\left[f_{y}^{\prime}(x,y)\right]^{2}}∣∇f=[fx(x,y)]2+[fy(x,y)]2
(2) The direction of the gradient:

  设v = { v 1 , v 2 } ( ∣ v ∣ = 1 ) v=\{v_1,v_2\} (|v|=1)v={ v1,v2}(v=1 ) is any given direction, then for∇ f \nabla ff andvvThe included angle θ \thetaof vθ 有: ∂ f ∂ v ∣ P 0 = f x ′ ( x 0 , y 0 ) v 1 + f y ′ ( x 0 , y 0 ) v 2 = { f x ′ ( x 0 , y 0 ) , f y ′ ( x 0 , y 0 ) } ∙ { v 1 , v 2 } = ∇ f ∣ P 0 ∙ v = ∣ ∇ f ∣ P 0 ∣ ⋅ ∣ v ∣ cos ⁡ θ \begin{aligned} \left.\frac{\partial f}{\partial v}\right|_{P_{0}} &=f_{x}^{\prime}\left(x_{0}, y_{0}\right) v_{1}+f_{y}^{\prime}\left(x_{0}, y_{0}\right) v_{2} \\ &=\left\{f_{x}^{\prime}\left(x_{0}, y_{0}\right), f_{y}^{\prime}\left(x_{0}, y_{0}\right)\right\} \bullet\left\{v_{1}, v_{2}\right\} \\ &=\left.\nabla f\right|_{P_{0}} \bullet v=|\nabla f|_{P_{0}}|\cdot |v| \cos \theta \end{aligned} vf P0=fx(x0,y0)v1+fy(x0,y0)v2={ fx(x0,y0),fy(x0,y0)}{ v1,v2}=fP0v=∣∇fP0vcosi
Since ∣ v ∣ = 1 |v|=1v=1 Then the above formula is:∂ f ∂ v ∣ P 0 = ∣ ∇ f ∣ P 0 ∣ cos ⁡ θ \frac{\partial f}{\partial v}|_{P_{0}} =|\nabla f| _{P_{0}}|\cos \thetavfP0=∣∇fP0cosi

Supplementary knowledge (high school content): known two vectors v 1 → = ( ​​a , b ) , v 2 → = ( ​​c , d ) \overrightarrow{v_1} =(a,b),\overrightarrow{v_2} =( c, d)v1 =(a,b),v2 =(c,d ) , then the inner product of two vectors:v 1 v 2 = ac + bd = ∣ v 1 → ∣ ∣ v 2 → ∣ cos ⁡ θ v_1v_2=ac+bd=|\overrightarrow{v_1}||\overrightarrow{v_2 }|\cos\thetav1v2=ac+bd=v1 ∣∣v2 cosθ ,except:cos ⁡ θ = ac + bd ∣ v 1 → ∣ ∣ v 2 → ∣ \cos\theta=\frac{ac+bd}{|\overrightarrow{v_1}||\overrightarrow{v_2}|}cosi=v1 ∣∣v2 ac+bd

∴ \therefore When the pointP 0 P_0P0direction of vvv and gradient direction∇ f ∣ P 0 \nabla f|_{P_{0}}fP0Angle is 0 0The maximum value is obtained at 0 , which proves that the value of the function changes the fastest along the gradient direction. This formula also proves that5, the direction of the gradient is the direction in which the function rises fastest at the specified point, and the opposite direction of the gradient is naturally the direction in which the function decreases the most. Fast Direction6:

  • If point P 0 P_0P0Continue to walk along the gradient direction, the function value increases
  • θ = 180 ° \theta = 180°i=180° means walking in the opposite direction of the gradient, and the function value decreases
  • Perpendicular to the gradient direction, the function value remains unchanged, that is: contour line

以上三个总结之后会用到, and then start the algorithm explanation of the steepest descent method.

1.3 Hessian Matrix

  The effect of the gradient is to give the vector field that represents its change. So I think gradient and derivative are more like a family. Moving on to the quadratic approximation: Q ( x ) = f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) + 1 2 f ′ ′ ( x 0 ) ( x − x 0 ) 2 Q(x )=f\left(x_{0}\right)+f^{\prime}\left(x_{0}\right)\left(x-x_{0}\right)+\frac{1}{2 } f^{\prime \prime}\left(x_{0}\right)\left(x-x_{0}\right)^{2}Q(x)=f(x0)+f(x0)(xx0)+21f′′(x0)(xx0)2 If we want to extend it to multivariate functions, then we think it should be a similar situation, except for the previous partL ( x , y ) = a + b ( x 0 , y 0 ) + c ( x 0 , y 0 ) L(x,y)=a+b(x_0,y_0)+c(x_0,y_0)L(x,y)=a+b(x0,y0)+c(x0,y0) , we would want to add the quadratic part: d ( x − x 0 ) 2 + e ( x − x 0 ) ( y − y 0 ) + f ( y − y 0 ) 2 d(x-x_0 )^2+e(x-x_0)(y-y_0)+f(y-y_0)^2d(xx0)2+and ( xx0)(yy0)+f(yy0)2
Q ( x , y ) = f ( x 0 , y 0 ) + f x ( x 0 , y 0 ) ( x − x 0 ) + f y ( x 0 , y 0 ) ( y − y 0 ) + 1 2 f x x ( x 0 , y 0 ) ( x − x 0 ) 2 + f x y ( x − x 0 ) ( y − y 0 ) + 1 2 f y y ( x 0 , y 0 ) ( y − y 0 ) 2 \begin{array}{c} Q(x, y)=f\left(x_{0}, y_{0}\right)+f_{x}\left(x_{0}, y_{0}\right)\left(x-x_{0}\right)+f_{y}\left(x_{0}, y_{0}\right)\left(y-y_{0}\right) \\ +\frac{1}{2} f_{x x}\left(x_{0}, y_{0}\right)\left(x-x_{0}\right)^{2}+f_{x y}\left(x-x_{0}\right)\left(y-y_{0}\right)+\frac{1}{2} f_{y y}\left(x_{0}, y_{0}\right)\left(y-y_{0}\right)^{2} \end{array} Q(x,y)=f(x0,y0)+fx(x0,y0)(xx0)+fy(x0,y0)(yy0)+21fxx(x0,y0)(xx0)2+fxy(xx0)(yy0)+21fyy(x0,y0)(yy0)2
Through analogy or calculation, we can get the result of the above quadratic approximation. We can then go on to derive the vector form of the quadratic approximation: Q ( x ) = f ( x 0 ) + ∇ f ( x 0 ) ⋅ ( x − x 0 ) + 1 2 ( x − x 0 ) TH ( x 0 ) ( x − x 0 ) Q(\mathbf{x})=f\left(\mathbf{x}_{0}\right)+\nabla f\left(\mathbf{x}_{\mathbf{0} }\right) \cdot\left(\mathbf{x}-\mathbf{x}_{\mathbf{0}}\right)+\frac{1}{2}\left(\mathbf{x}-\ mathbf{x}_{0}\right)^{T} H\left(\mathbf{x}_{0}\right)\left(\mathbf{x}-\mathbf{x}_{0}\ right)Q(x)=f(x0)+f(x0)(xx0)+21(xx0)TH(x0)(xx0)
whereHHH 是 Hessian 矩阵: ∇ ( ∇ f ) = H = [ ∂ f ∂ x 2 ∂ f ∂ x ∂ y ∂ f ∂ y ∂ x ∂ f ∂ y 2 ] \nabla (\nabla f)=H=\left[\begin{array}{cc} \frac{\partial f}{\partial x^{2}} & \frac{\partial f}{\partial x \partial y} \\ \frac{\partial f}{\partial y \partial x} & \frac{\partial f}{\partial y^{2}} \end{array}\right] (f)=H=[x2fyxfxyfy2f]
当然我们也可以推广到更高的维度: H = [ ∂ 2 f ∂ x 1 2 ∂ 2 f ∂ x 1 ∂ x 2 ⋯ ∂ 2 f ∂ x 1 ∂ x n ∂ 2 f ∂ x 2 ∂ x 1 ∂ 2 f ∂ x 2 2 ⋯ ∂ 2 f ∂ x 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ 2 f ∂ x n ∂ x 1 ∂ 2 f ∂ x n ∂ x 2 ⋯ ∂ 2 f ∂ x n 2 ] \mathbf{H}=\left[\begin{array}{cccc} \frac{\partial^{2} f}{\partial x_{1}^{2}} & \frac{\partial^{2} f}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{n}} \\ \frac{\partial^{2} f}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{2}^{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{2} \partial x_{n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^{2} f}{\partial x_{n} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{n} \partial x_{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{n}^{2}} \end{array}\right] H= x122 fx2x12 fxnx12 fx1x22 fx222 fxnx22 fx1xn2 fx2xn2 fxn22 f The Hessian matrix must be a symmetric matrix

2 Steepest descent method

  Through the review of the above gradient knowledge, we will not feel at a loss when we look at this algorithm again. The fastest gradient descent method : the problem to be solved is an unconstrained optimization problem, and the so-called unconstrained optimization problem is to solve the objective function without any constraints, such as finding the minimum value below: minf ( x ) min f ( x)minf(x)

where the function f : R n → R f:R^n \to Rf:RnR.

Solving such problems can be divided into two categories: optimal condition methods and iterative methods

2.1 Concept 7

The steepest descent method (Steepest descent) is a more specific implementation of the gradient descent method. The idea is to select an appropriate step size α k \alpha_k in each iterationak, so that the objective function value can be minimized.

Each iteration, along the reverse direction of the gradient , we can always find a x ( k + 1 ) = xk − α k ⋅ ∇ f ( x ( k ) ) x^{(k+1)} = x^k - \ alpha_k \cdot \nabla f(x^{(k)})x(k+1)=xkakf(x( k ) ), such that in this directionf ( x ( k + 1 ) ) f(x^{(k+1)})f(xIf ( k + 1 ) )is the default, thenα ( k ) = argmin ⁡ f ( xk − α ⋅ ∇ f ( x ( k ) ) ) \alpha_{(k)}=\operatorname{argmin}f\left( x^{k}-\alpha \cdot \nabla f\left(x^{(k)}\right)\right)a(k)=argminf(xkaf(x(k)))

The next interesting thing that can be posted is: the trajectory of each update of the steepest descent method is vertical to the previous one (the proof process will be explained in 2.3)

2.2 Algorithm step 7

The calculation steps of the steepest descent method are given below:

Step 1: Pick the initial point x 0 x^0x0 , given a termination errorε > 0 \varepsilon >0e>0 , letk = 0 k=0k=0

Step 2: Calculate ∇ f ( xk ) \nabla f(x^k)f(xk )Let∣ ∣ ∇ f ( xk ) ∣ ∣ ≤ ε ||\nabla f(x^k)|| \le \varepsilon∣∣∇f(xk)∣∣ε , stop iteration, outputxkx^kxk . Otherwise go to the third step;

Step 3: Take pk = − ∇ f ( xk ) p^k = -\nabla f(x^k)pk=f(xk)

Step 4: Perform a one-dimensional search to find λ k \lambda_klk,definition of ( xk + λ kpk ) = min ⁡ λ ≥ 0 f ( xk + λ pk ) f\left(x^{k}+\lambda_{k} p^{k}\right)=\min _{ \lambda \geq 0} f\left(x^{k}+\lambda p^{k}\right)f(xk+lkpk)=λ0minf(xk+p _k )Letxk + 1 = xk + λ kpk , k = k + 1 , turn to step 2 x^{k+1} = x^k + \lambda_k p^k, k=k+1, \text{turn step 2}xk+1=xk+lkpkk=k+1 , go to step  2 

2.3 Detailed explanation of the principle 8

   任x = x ( k ) + λ p ( k ) x=x^{(k)}+\lambda p^{(k)}x=x(k)+p _( k ) atx ( k ) x^{(k)}xTake the first-order Taylor expansion at point ( k ) f ( x ) = f ( x ( k ) + λ p ( k ) ) = f ( x ( k ) ) + λ ∇ f ( x ( k ) ) T p ( k ) + O ( ∣ ∣ λ p ( k ) ∣ ∣ ) f(x)=f(x^{(k)}+\lambda p^{(k)}) = f(x^{(k)}) +\lambda \nabla f(x^{(k)})^{T} p^{(k)}+O(||\lambda p^{(k)}||)f(x)=f(x(k)+p _(k))=f(x(k))+λf(x(k))Tp(k)+O(∣∣λp( k ) ∣∣)exceptO ( ∣ ∣ λ p ( k ) ∣ ∣ ) = O ( ∣ ∣ λ ∣ ∣ ) O(||\lambda p^{(k)}||)=O(||\ lambda||)O(∣∣λp(k)∣∣)=O ( ∣∣ λ ∣∣ ) is a ratio ofλ \lambdaλ high-order infinitesimal, and∣ ∣ p ( k ) ∣ ∣ = 1 ||p^{(k)}||=1∣∣p(k)∣∣=1.

The value of ( x ( k + 1 ) ) − f ( x ( k ) ) ≈ λ ∇ f ( x ( k ) ) T p ( k ) f(x^{(k+1)})-f(x^ {(k)}) \approx \lambda \nabla f(x^{(k)})^{T} p^{(k)}f(x(k+1))f(x(k))λf(x(k))Tp( k )
∇ f ( x ( k ) ) T p ( k ) = ∣ ∣ ∇ f ( x ( k ) ) ∣ ∣ ⋅ ∣ ∣ p ( k ) ∣ ∣ cos ⁡ θ = ∣ ∣ ∇ f ( x ( k ) ) ) ∣ ∣ ⋅ cos ⁡ θ \equation f(x^{(k)})^{T} p^{(k)}=||\equation f(x^{(k)})||\cdot ||p^{(k)}|| \cos \theta=||\write f(x^{(k)})||\cdot\cos \thetaf(x(k))Tp(k)=∣∣∇f(x(k))∣∣∣∣p(k)∣∣cosi=∣∣∇f(x(k))∣∣cosi

Search direction p ( k ) p^{(k)}p( k ) : In the first section, it is described that takingθ = 180 ° \theta = 180°i=The reason for the negative direction of the 180° gradient. ∴ \therefore p ( k ) = − ∇ f ( x ( k ) ) p^{(k)}=-\nabla f(x^{(k)})p(k)=f(x(k))

Find the magnitude of the gradient vector: ∣ ∣ ∇ f ( x ( k ) ) T ∣ ∣ ||\nabla f(x^{(k)})^{T}||∣∣∇f(x(k))T∣∣.

  • Then f ( x ( k ) ) T ∣ ∣ < ε ||\nabla f(x^{(k)})^{T}|| < \varepsilon∣∣∇f(x(k))T∣∣<ε , stop calculation, outputx ( k ) x^{(k)}x( k ) as a minimum point approximation.
  • Then f ( x ( k ) ) T ∣ ∣ > ε ||\nabla f(x^{(k)})^{T}|| > \varepsilon∣∣∇f(x(k))T∣∣>ε , go to the next step

Many explanations did not explain this place in detail. In fact, I didn't quite understand why I only needed to prove the modulus length of the gradient vector ∣ ∣ ∇ f ( x ( k ) ) T ∣ ∣ < ε ||\nabla f(x^{(k)})^{T}|| < \varepsilon∣∣∇f(x(k))T∣∣<ε will do.

Proof:
  In fact, the formula can be combined ( ∣ ∣ p ( k ) ∣ ∣ = 1 ||p^{(k)}||=1∣∣p(k)∣∣=1 ):f ( x ( k + 1 ) ) − f ( x ( k ) ) ≈ λ ∇ f ( x ( k ) ) T p ( k ) = ∣ ∣ ∇ f ( x ( k ) ) ∣ ∣ ⋅ cos ⁡ θ f(x^{(k+1)})-f(x^{(k)}) \approx \lambda \nabla f(x^{(k)})^{T} p^{(k )}=||\nable f(x^{(k)})||\cdot\cos\thetaf(x(k+1))f(x(k))λf(x(k))Tp(k)=∣∣∇f(x(k))∣∣cosθ , wheref ( x ( k + 1 ) ) f(x^{(k+1)})f(x( k + 1 ) )represents the next optimization value,f ( x ( k ) ) f(x^{(k)})f(x( k ) )represents the current iterative value:
左边式子的含义:the difference between the current value and the future value is getting smaller and smaller, which means that the speed of change is getting slower and slower. If the speed slows down, it does not mean that the optimal The minimum point is getting closer.
右边式子含义:The difference between the two expressions on the left is finally equal to the size of a gradient∣ ∣ ∇ f ( x ( k ) ) ∣ ∣ ||\nabla f(x^{(k)})||∣∣∇f(x( k ) )∣∣×cos ⁡ θ \cos \thetacosθ , we all knowcos ⁡ θ ∈ [ 0 , 1 ] \cos \theta \in [0,1]cosi[0,1] ∴ \therefore ∣ ∣ ∇ f ( x ( k ) ) T ∣ ∣ < ε ||\nabla f(x^{(k)})^{T}|| < \varepsilon∣∣∇f(x(k))T∣∣<When ε, does it mean that the value you are looking for can be approximately regarded as a minimum value, andx ( k ) x^{(k)}x( k ) is the minimum point.

Optimal step size λ k \lambda_klk

  Let f ( x ) f(x)f ( x ) has a second order continuous partial derivative, put it inx ( k ) x^{(k)}x(k) 点作二阶泰勒展开: f ( x ( k ) − λ ∇ f ( x ( k ) ) ) = f ( x ( k ) ) + ∇ f ( x ( k ) ) T ( − λ ∇ f ( x ( k ) ) ) + 1 2 ( − λ ∇ f ( x ( k ) ) ) T H ( x ( k ) ) ( − λ ∇ f ( x ( k ) ) ) + O ( ∥ λ ∇ f ( x ( k ) ∥ 2 ) \begin{aligned} f\left(x^{(k)}-\lambda \nabla f\left(x^{(k)}\right)\right) &=f\left(x^{(k)}\right)+\nabla f\left(x^{(k)}\right)^{T}\left(-\lambda \nabla f\left(x^{(k)}\right)\right) \\ &+\frac{1}{2}\left(-\lambda \nabla f\left(x^{(k)}\right)\right)^{T} H\left(x^{(k)}\right)\left(-\lambda \nabla f\left(x^{(k)}\right)\right) \\ &+O\left(\| \lambda \nabla f\left(x^{(k)} \|^{2}\right)\right. \end{aligned} f(x(k)λf(x(k)))=f(x(k))+f(x(k))T(λf(x(k)))+21(λf(x(k)))TH(x(k))(λf(x(k)))+O(λf(x(k)2)
Denote the main part of the above formula as H ( λ ) H(\lambda)H(λ) H ( λ ) = f ( x ( k ) ) + ∇ f ( x ( k ) ) T ( − λ ∇ f ( x ( k ) ) ) + 1 2 ( − λ ∇ f ( x ( k ) ) ) T H ( x ( k ) ) ( − λ ∇ f ( x ( k ) ) ) H(\lambda)= f\left(x^{(k)}\right)+\nabla f\left(x^{(k)}\right)^{T}\left(-\lambda \nabla f\left(x^{(k)}\right)\right)+\frac{1}{2}\left(-\lambda \nabla f\left(x^{(k)}\right)\right)^{T} H\left(x^{(k)}\right)\left(-\lambda \nabla f\left(x^{(k)}\right)\right) H ( l )=f(x(k))+f(x(k))T(λf(x(k)))+21(λf(x(k)))TH(x(k))(λf(x( k ) ))
functionH ( λ ) H(\lambda)The only extreme point of H ( λ ) (let H ′ ( λ ) = 0 H'(\lambda)=0H (λ)=0 , you can findλ \lambdaλ)为: λ k = ∇ f ( x ( k ) ) T ∇ f ( x ( k ) ) ∇ f ( x ( k ) ) T H ( x ( k ) ) ∇ f ( x ( k ) ) \lambda_{k}=\frac{\nabla f\left(x^{(k)}\right)^{T} \nabla f\left(x^{(k)}\right)}{\nabla f\left(x^{(k)}\right)^{T} H\left(x^{(k)}\right) \nabla f\left(x^{(k)}\right)} lk=f(x(k))TH(x(k))f(x(k))f(x(k))Tf(x(k))
where H ( x ( k ) ) H(x^{(k)})H(x( k ) )said in section 1 is a Hessian matrix

这里证明前面说的垂直关系,证明如下:

将点 x ( k + 1 ) = x ( k ) + λ p k x^{(k+1)}=x^{(k)}+\lambda p^{k} x(k+1)=x(k)+λpk 代入函数 f ( x ) f(x) f(x) 中,再令 f ′ ( x ) = 0 f'(x)=0 f(x)=0 可以得到: f ′ ( x ( k + 1 ) ) = f ′ ( x ( k ) + λ p k ) = ∇ f ( x ( k ) + λ p k ) T p ( k ) = 0 f'(x^{(k+1)})=f'(x^{(k)}+\lambda p^{k})=\nabla f(x^{(k)}+\lambda p^{k})^{T}p^{(k)}=0 f(x(k+1))=f(x(k)+λpk)=f(x(k)+λpk)Tp(k)=0
p(k) = − ∇ f(x(k)) p^{(k)}=-\nabla f(x^{(k)})p(k)=f(x( k ) )Differentiate:− ∇ f ( x ( k ) + λ pk ) T ∇ f ( x ( k ) ) = − ∇ f ( x ( k + 1 ) ) T ∇ f ( x ( k ) ) . = 0 -\equation f(x^{(k)}+\lambda p^{k})^{T}\equation f(x^{(k)})=-\equation f(x^{(k). +1)})^{T}\equation f(x^{(k)})=0f(x(k)+p _k)Tf(x(k))=f(x(k+1))Tf(x(k))=0
This does not prove that the gradient between the previous point and the next point is vertical

Let's talk about the step size λ \lambdaThe influence of choosing large or small lambda, taking the quadratic function of one variable as an example, as shown in Figure 1 (I drew this picture with matlab, and the code will be enlarged at the end): Figure 1 \
insert image description here
text{Figure 1}Figure  1
can be seen from this figure that the optimal step size isλ op \lambda_{op}lop,input ∈ ( 0 , λ op ) ∪ ( λ op , 2 λ op ) \lambda \in(0,\lambda_{op}) \cup (\lambda_{op},2\lambda_{op})l(0,lop)( lop,2 minop) with a lot of iterations, the convergence speed is very slow. Interestingly, whenλ = 2 λ op \lambda =2\lambda _{op}l=2 minopDo not iterate from time to time and iterate back and forth directly on the contour line. When λ ∈ ( 2 λ ​​op , ∞ ) \lambda \in(2\lambda_{op},\infty)l( 2 minop,) iterated in reverse, away from the minimum point.

2.4 Disadvantages 7

The negative gradient direction of a certain point usually only has the property of the steepest descent near this point.

最速下降法利用目标函数一阶梯度进行下降求解,易产生锯齿现象(如下图 ),在开始几步,目标函数下降较快;但在接近极小点时,收敛速度长久不理想了。特别适当目标函数的等值线为比较扁平的椭圆时,收敛就更慢了。
insert image description here
因此,在实用中常用最速下降法和其他方法联合应用,在前期使用最速下降法,而在接近极小值点时,可改用收敛较快的其他方法

3 示例说明

  这个例子就用这篇文章的示例9 虽然用了他的例子本人代码和他代码完全不一样。如果大家感兴趣也可以去看这个人的代码。好了废话不多说上题目: m i n f ( x ) = x 1 2 + 2 x 2 2 − 2 x 1 x 2 − 2 x 2 min f(x) = x_1^{2}+2x_2^2-2x_1x_2-2x_2 minf(x)=x12+2x222x1x22x2其中 x = ( x 1 , x 2 ) T x=(x_1,x_2)^T x=(x1,x2)T x 0 = ( 0 , 0 ) T x^{0}=(0,0)^T x0=(0,0)T

(1) Find the gradient function ∇ f ( x ) = ( 2 x 1 − 2 x 2 4 x 2 − 2 x 1 − 2 ) \nabla f(x)=\begin{pmatrix} 2x_1-2x_2\\ 4x_2-2x_1 -2\end{pmatrix}f(x)=(2x _12x _24x _22x _12) Hessian functionH ( x ) = ∇ ( ∇ f ( x ) ) = ( 2 − 2 − 2 4 ) H(x)=\exponent(\exponent f(x))=\begin{pmatrix} 2&-2\ \-2&4\end{pmatrix}H(x)=(f(x))=(2224)

(2) Put x ( 0 ) = ( 0 , 0 ) T x^{(0)}=(0,0)^Tx(0)=(0,0)Substituting T into the above formula (1), we get∇ f ( x ( 0 ) ) = ( 0 − 2 ) \nabla f(x^{(0)})=\begin{pmatrix} 0\\ -2\end{ pmatrix}f(x(0))=(02) ,即 p ( 0 ) = − ∇ f ( x ( 0 ) ) = ( 0 2 ) p^{(0)}=-\nabla f(x^{(0)})=\begin{pmatrix} 0\\ 2\end{pmatrix} p(0)=f(x(0))=(02)

(3) H ( x ( 0 ) ) = ( 2 − 2 − 2 4 ) H(x^{(0)})=\begin{pmatrix} 2&-2\\ -2&4\end{pmatrix} H(x(0))=(2224) and∇ f ( x ( 0 ) ) \nabla f(x^{(0)})f(x(0)) 代入 λ k = ∇ f ( x ( k ) ) T ∇ f ( x ( k ) ) ∇ f ( x ( k ) ) T H ( x ( k ) ) ∇ f ( x ( k ) ) \lambda_{k}=\frac{\nabla f\left(x^{(k)}\right)^{T} \nabla f\left(x^{(k)}\right)}{\nabla f\left(x^{(k)}\right)^{T} H\left(x^{(k)}\right) \nabla f\left(x^{(k)}\right)} lk=f(x(k))TH(x(k))f(x(k))f(x(k))Tf(x(k))
得到 λ = ( 0 − 2 ) ( 0 − 2 ) ( 0 − 2 ) ( 2 − 2 − 2 4 ) ( 0 − 2 ) = 1 4 \lambda=\frac{\begin{pmatrix} 0& -2\end{pmatrix} \begin{pmatrix} 0\\ -2\end{pmatrix}}{\begin{pmatrix} 0& -2\end{pmatrix}\begin{pmatrix} 2&-2\\ -2&4\end{pmatrix} \begin{pmatrix} 0\\ -2\end{pmatrix}}=\frac{1}{4} l=(02)(2224)(02)(02)(02)=41
(4) x ( 1 ) = x ( 0 ) + λ p ( 0 ) = ( 0 1 2 ) x^{(1)}=x^{(0)}+\lambda p^{(0)}=\begin{pmatrix} 0\\ \frac{1}{2}\end{pmatrix} x(1)=x(0)+p _(0)=(021)

In the same way, switch back to (2) and iterate back until the condition is met, and the optimal solution is x ∗ = ( 1 , 1 ) T , y ∗ = − 1 x^*=(1,1)^T, y ^*=-1x=(1,1)Ty=1

4 代码实现

  根据上面的方法步骤来看,其实编写代码的时候只需要几个模块就行,(1)求 ∇ f ( x ) \nabla f(x) f(x) 的函数,(2)求 Hessian 矩阵 H ( x ) H(x) H(x) 的模块 ;(3)将前面两个函数放在一起构造一个求最速下法的函数即可;(4)为了可视化,本人又增加了一个生成GIF的动态最速下降法的画图函数。

声明

  • 构造了:df=nabla_f(fun,x),求 ∇ f ( x ) \nabla f(x) f(x) 的函数
  • 构造了:H=Hesse(df,x,x0,n),求 Hessian 矩阵 H ( x ) H(x) H(x) 的模块
  • 构造了:[x0,i] = GD(f,x,x0,epsilon),为最速下降法函数
  • 构造了:GDplot(f,x0,i,x1,x2,GIFname),最速下降画图函数
  • My code is now suitable for binary n-time functions to perform gradient descent to find the minimum point. ( (2022.5.30)现已经解决了限制,可以实现n元n次函数,GDplot()画图函数只能绘制三维图)

注意: In the next few quick codes, there will be applications that remind everyone that some functions are necessary at the beginning.

Put the result first:
insert image description here
the generated GIF image:
insert image description here

4.1 ∇ f ( x ) \nabla f(x)f ( x ) function module code implementation

There are two inputs of this function : (1) fun: Indicates a function expression, which is generally expressed in the form of a handle function in matlab. The writing method of the handle function has been written at the beginning of the article, or students who don’t know it can go to the first one of the references. (2) x: Indicates the variable string, here is generally x 1 , x 2 x_1,x_2x1,x2, examples are also given below.
Output : df: is a handle function

code show as below:

function df=nabla_f(fun,x)
%  ∇f 求梯度
    df=[];
    x=str2sym(x);
    for i=1:length(x)
        df1 = diff(fun,x(i));
        df=[df;df1];
    end
    df= matlabFunction(df);
end

For example, after entering the following code to run, you can get a df handle function :

f=@(x1,x2) x1.^2+2*x2.^2-2*x1.*x2-2*x2;
x ='[x1,x2]';
df=nabla_f(f,x)

output:
insert image description here

4.2 Hessian matrix H ( x ) H(x)Module code implementation of H ( x )

There are 4 inputs to this function : (1) df: expressed as ∇ f ∇ff , which is the result of the function nabla_f. (2)x: Indicates the variable string, here is generallyx 1 , x 2 x_1,x_2x1,x2. (3) x0: Indicates the initial x ( 0 ) x^{(0)}x( 0 ) . (4)n: Indicates the current number of iterations.
Output:df: is a handle function

code show as below:

function H = Hesse(df,x,x0,n)
%  Hesse矩阵 H(x)=(∇f) 
%  df 为 ∇f,即为函数nabla_f的结果 ∇f
    H=[];
    x=str2sym(x);
    for i=1:length(x)
        df1 = diff(df,x(i));
        H=[H,df1];
    end
%     H = matlabFunction(H);
    s=char(H);
    if find(s=='x')
        H = matlabFunction(H);
        H = H(x0{
    
    n}(1),x0{
    
    n}(2));
    else
        H = double(H);
    end
end

4.3 Code Implementation of the Steepest Descent Method Function Module

Input : (1) f: For the handle function defined above For example: f= @(x1,x2) 2*x1.^2+2*x2.^2+2*x1.*x2+x1-x2。(2) x: To represent variable strings, here are generally x 1 , x 2 x_1,x_2x1,x2.(3) x0: Indicates the starting point x ( 0 ) x^{(0)}x( 0 )。(4)epsilon singεε precision

code show as below:

function [x0,i] = GD(f,x,x0,epsilon)
% 梯度下降法 也可以称:最速下降法
% input:f 为上面定义的句柄函数 例如:f= @(x1,x2) 2*x1.^2+2*x2.^2+2*x1.*x2+x1-x2;
%        x 为表示变量字符串,这里一般是 x1,x2
%        x0 表示起始点
%        epsilon 为ε精度
% output:x0 为GD函数出来的从起始点x0寻优到极小值点的所有点集合
%        i 为GD函数寻优过程得到点的个数即:i==length(x0)
%        可以打印出最小值 min{
    
    f},以及极小值值点 x
%

    i=1;
    df = nabla_f(f,x);
    H = Hesse(df,x,x0,i);
    dfx=df(x0{
    
    1}(1),x0{
    
    1}(2));
   
    er=norm(dfx);
    while er > epsilon                
            p = -dfx;
            lambda = dfx'*dfx/(dfx'*H*dfx);
            i=i+1;
            x0{
    
    i} = x0{
    
    i-1}+lambda*p;
            dfx = df(x0{
    
    i}(1),x0{
    
    i}(2));
            H = Hesse(df,x,x0,i);
            er = norm(dfx);     
    end
    fmin = f(x0{
    
    i}(1),x0{
    
    i}(2));
    disp('极小值点:');
    disp(['x1 = ',num2str(x0{
    
    i}(1))]);
    disp(['x2 = ',num2str(x0{
    
    i}(2))]);
    fprintf('\nf最小值:\n min{f}=%f\n',fmin);
end

*4.4 The code implementation of the steepest descent method drawing function module

Students who are interested in this function code can take a look. I just put the code here without explaining it. The main reason is that it is too tiring to write for such a long time, so it’s not too much to be lazy!

创新点:1. The title of the graph can be automatically generated according to the expression of f, which means that there is no need to manually draw the title of the graph, and input any fff can implement anyfff 的图标题显示,这都得益于写了一个 f2t() 函数,该函数实现了将句柄函数格式 (function_handle) 转化为 T e x Tex Tex 格式的字符串。

更新2022-11-6已经将suptitle改为了sgtitle,并上传了f2s函数

代码如下:

function GDplot(f,x0,i,x1,x2,GIFname)
% input:f 为上面定义的句柄函数 例如:f= @(x1,x2) 2*x1.^2+2*x2.^2+2*x1.*x2+x1-x2;
%        x0 为GD函数出来的从起始点x0寻优到极小值点的所有点集合
%        i 为GD函数寻优过程得到点的个数即:i==length(x0)
%        x1 为f函数中第一个变量的取值范围
%        x2 为f函数中第二个变量的取值范围
% 
% output:生成一个gif

        [x1,x2]=meshgrid(x1,x2);
        z=f(x1,x2);
        figure('color','w')
        sgtitle(['\it f=',f2s(f)])
        subplot(211) 
        mesh(x1,x2,z)
        axis off
        view([-35,45])
        hold on
        subplot(212)
        contour(x1,x2,z,20)
        zlim([0,0.5])
        set(gca,'ZTick',[],'zcolor','w')
        axis off
        view([-35,45])
        hold on

        pic_num = 1;

        for j =1:i-1    
            a=[x0{
    
    j}(1),x0{
    
    j}(2),f(x0{
    
    j}(1),x0{
    
    j}(2))];
            b=[x0{
    
    j+1}(1),x0{
    
    j+1}(2),f(x0{
    
    j+1}(1),x0{
    
    j+1}(2))];
            c=[a',b'];
            a1=[x0{
    
    j}(1),x0{
    
    j}(2)];
            b1=[x0{
    
    j+1}(1),x0{
    
    j+1}(2)];
            c1=[a1',b1'];
    
            subplot(211)
            plot3(x0{
    
    j}(1),x0{
    
    j}(2),f(x0{
    
    j}(1),x0{
    
    j}(2)),'r.','MarkerSize',10)
            subplot(212)
            plot(x0{
    
    j}(1),x0{
    
    j}(2),'r.','MarkerSize',10)
            drawnow
            F(j)=getframe(gcf);
            pause(0.5)

            subplot(211)
            plot3(c(1,:),c(2,:),c(3,:),'r--')    
            subplot(212)
            plot(c1(1,:),c1(2,:),'r--')
            drawnow 
            F(2*j)=getframe(gcf);
            pause(0.5)
            
            % 绘制并保存gif
            I=frame2im(F(j));
            [I,map]=rgb2ind(I,256);
            I1=frame2im(F(2*j));
            [I1,map1]=rgb2ind(I1,256);
            if pic_num == 1
                imwrite(I,map, GIFname ,'gif', 'Loopcount',inf,'DelayTime',0.5);
            else
                imwrite(I,map, GIFname ,'gif','WriteMode','append','DelayTime',0.5);
                imwrite(I1,map1, GIFname ,'gif','WriteMode','append','DelayTime',0.5);
            end
            pic_num = pic_num + 1;
        end
        subplot(211)
        plot3(x0{
    
    i}(1),x0{
    
    i}(2),f(x0{
    
    i}(1),x0{
    
    i}(2)),'r.','MarkerSize',9)
        subplot(212)
        plot(x0{
    
    i}(1),x0{
    
    i}(2),'r.','MarkerSize',9)
        F(2*i-1)=getframe(gcf);
        I=frame2im(F(2*i-1));
        [I,map]=rgb2ind(I,256);
        imwrite(I,map, GIFname ,'gif','WriteMode','append','DelayTime',0.5);
end

function s = f2s(fun)
% 句柄函数的转换为字符串 
% 主要是用来画图title用的,可自动将句柄函数转为字符串函数
% 在绘制画图时可以自动生成函数表达式 title(并且以Latex形式显示出来),避免手动敲击函数公式
    s = func2str(fun);
    s = char(s);
    c = strfind(s,')');
    s(1:c(1))=[];
    c1 = strfind(s,'.');
    s(c1)=[];
    c2 = strfind(s,'*');
    s(c2)=[];
    c3 = strfind(s,'x');
    for  i = 1:length(c3)
       s = insertAfter(s,c3(i)+i-1,'_'); 
    end   
end

4.5 主函数

代码如下:

clc
clear all
close all
% f= @(x1,x2) 2*x1.^2+2*x2.^2+2*x1.*x2+x1-x2;
f= @(x1,x2) x1.^2+2*x2.^2-2*x1.*x2-2*x2;      % 句柄函数表达式
% f= @(x1,x2) (x1-1).^2+(x2-1).^2;
% f=@(x1,x2) x1.^4+3*x1.^2*x2-x2.^4;
x0{
    
    1}=[0;0];
x ='[x1,x2]';      % 函数变量字符串
epsilon=0.001;     % ε为误差精度
[x0,i] = GD(f,x,x0,epsilon);
x1=0:0.01:2;
x2=x1;
GIFname = 'f1.gif';
GDplot(f,x0,i,x1,x2,GIFname)

4.6 上面图1的代码

图1代码直接下载文件链接:点击此处下载代码文件

4.7 完整代码文件

如果是小白的同学可以下载这个完整代码(包含本文章的代码+图1的代码):
代码已经放入个人GitHub中:https://github.com/cug-auto-zp/CSDN/tree/main/gradient
如果进不了GitHub的读者,这里还提供了资源下载链接:资源链接

5 总结

写这篇文章还是要感谢这些博主写的文章个人感觉看了他们的文章收获非常多。给大家代码一方面是方便大家学习,另一方面是我也是通过大佬们的文章和公开的代码发现matlab有很个人不知道的函数,边看边学代码,有很多人问我是这么学的matlab,其实也不用刻意去学习这个东西,我都是在边看边积累的过程不断地扩充了自己的matlab的知识。所以个人坚持将这种开源的代码的思想一直进行下去。

6 引用参考


  1. 句柄函数 ↩︎

  2. matlab官方文档——matlabFunction:https://ww2.mathworks.cn/help/symbolic/matlabfunction.html ↩︎

  3. Directional derivative and gradient: https://blog.csdn.net/myarrow/article/details/51332421 ↩︎

  4. What exactly is a gradient? What are the physical and mathematical meanings? : https://www.zhihu.com/question/29151564 ↩︎

  5. Gradient, divergence, curl: https://zhuanlan.zhihu.com/p/97545154 ↩︎

  6. The eighth lecture gradient descent method: https://zhuanlan.zhihu.com/p/335191534 ↩︎

  7. Unconstrained optimization problem - steepest descent method: https://zhuanlan.zhihu.com/p/445223282 ↩︎ ↩︎ ↩︎

  8. Steepest descent method (top): https://www.bilibili.com/video/BV1RK4y1a75C/
    Steepest descent method (below):
    https://www.bilibili.com/video/BV1ey4y1k7jY/?spm_id_from=333.788.recommend_more_video.- 1 ↩︎

  9. Optimization algorithm - the steepest descent method: https://blog.csdn.net/m0_37570854/article/details/88559619 ↩︎

Guess you like

Origin blog.csdn.net/cugautozp/article/details/124895678