Newton's method, gradient descent method, Gauss-Newton's method, Levenberg-Marquardt algorithm

What is a gradient?

General explanation:

The gradient of f(x) at x0: it is the direction in which f(x) changes the fastest

For example, f() is a mountain, standing halfway up the mountain,

The graph of a strictly concave quadratic function is shown in blue, with its unique maximum shown as a red dot. Below the graph appears the contours of the function: The level sets are nested ellipses.

Walking 1 meter in the x direction, the height rises by 0.4 meters, which means that the partial derivative in the x direction is 0.4

Walking 1 meter in the y direction, the height rises by 0.3 meters, which means that the partial derivative in the y direction is 0.3

In this way, the gradient direction is (0.4, 0.3), that is, if you walk 1 meter in this direction, the height you will rise is the highest.

(1*0.4/0.5)*0.4 +(1*0.3/0.5)*0.3 = 1*(0.3^2+0.4^2) = 0.5

The premise of using the Pythagorean theorem here is that the distance traveled is small enough, because when the distance traveled is small enough, the continuous surface is regarded as a plane. Of course, the Pythagorean theorem can be used. The distance of 1 meter is mentioned here for simplicity.

Walking 1 meter in the x direction, the height rises by 0.4 meters, which means that the partial derivative in the x direction is 0.4.

Walking 1 meter in the y direction, the height drops 0.3 meters, which means that the partial derivative in the y direction is -0.3, which means that it is rising along the -y axis.

Then the gradient direction is (0.4, -0.3), imagine that it is in the fourth quadrant of the coordinate axis, and the components along the x and -y axes will rise.

As in the case above, the distance is still 0.5 

Therefore, the gradient is not only the direction in which f(x) changes the fastest at a certain point, but also the direction in which it rises the fastest, just as in the example of indoor temperature, the direction in which the temperature rises the fastest.


Gradient descent method:

From the above discussion, we can know that the gradient is the fastest rising direction, then if I want to go down the mountain, the fastest descending direction, of course, is against the gradient (approximating the surface near a point as a plane), this is the gradient descent method , because it is against the gradient, the descent is the fastest, also known as the steepest descent method.

Iterative formula:

\mathbf{b} = \mathbf{a}-\gamma\nabla F(\mathbf{a})

r is the step size.


Newton's method:

(Partly borrowed from: http://blog.csdn.net/luoleicn/article/details/6527049 )

Solve the equation problem:

Newton's method was originally used to solve the equation root f(x) = 0

First get an initial solution x0,

First-order expansion: f(x) ≈ f(x0)+(x-x0)f'(x0)

令 f(x0)+(x-x0)f'(x0) = 0 

求解得到x,相比于x0,f(x)<f(x0)  (具体证明可以查看数值分析方面的书)


最优化问题中,牛顿法首先则是将问题转化为求 f‘(x) = 0 这个方程的根。

首先得到一个初始解 x0,

一阶展开:f ’(x) ≈ f ‘(x0)+(x-x0)f '’(x0)

令 f ‘(x0)+(x-x0)f '’(x0) = 0 

求解得到x,相比于x0,f ‘(x)<f ’(x0)

也可以用上面借鉴的博文(也是wiki)中所说的那种方法,把 delta x 作为一个变量,让 f(x)对其偏导为0 什么的, 反正我记不住。


最小化f(x)之高维情况:

\mathbf{x}_{n+1} = \mathbf{x}_n - \gamma[H f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n).


梯度 代替了低维情况中的一阶导

Hessian矩阵代替了二阶导

求逆 代替了除法


wiki上的一个图,可以看到 二者区别,

梯度下降法(绿色)总是沿着当前点最快下降的方向(几乎垂直于等高线),相当于贪心算法

牛顿法利用了曲面本身的信息,能够更直接,更快的收敛。

A comparison of  gradient descent(green) and Newton's method (red) for minimizing a function (with small step sizes). Newton's method uses curvature information to take a more direct route.

高斯牛顿法:

高斯牛顿法实际上是牛顿法的在求解非线性最小二乘问题时的一个特例。

目标函数:

S(\boldsymbol \beta)= \sum_{i=1}^m r_i^2(\boldsymbol \beta).

该函数是趋于0的,所以直接使用高维的牛顿法来求解。

迭代公式:


\boldsymbol\beta^{(s+1)} = \boldsymbol\beta^{(s)} - \mathbf H^{-1} \mathbf g \,

和牛顿法中的最优化问题高维迭代公式一样


目标函数可以简写:

 S = \sum_{i=1}^m r_i^2,


梯度向量在方向上的分量:

g_j=2\sum_{i=1}^m r_i\frac{\partial r_i}{\partial \beta_j}.                  (1)


Hessian 矩阵的元素则直接在梯度向量的基础上求导:

H_{jk}=2\sum_{i=1}^m \left(\frac{\partial r_i}{\partial \beta_j}\frac{\partial r_i}{\partial \beta_k}+r_i\frac{\partial^2 r_i}{\partial \beta_j \partial \beta_k} \right).


高斯牛顿法的一个小技巧是,将二次偏导省略,于是:

H_{jk}\approx 2\sum_{i=1}^m J_{ij}J_{ik}(2)


将(1)(2)改写成 矩阵相乘形式:

\mathbf g=2\mathbf{J_r}^\top \mathbf{r}, \quad \mathbf{H} \approx 2 \mathbf{J_r}^\top \mathbf{J_r}.\,


其中 Jr 是雅克比矩阵,r是 ri 组成的列向量。

代入牛顿法高维迭代方程的基本形式,得到高斯牛顿法迭代方程:

\boldsymbol{\beta}^{(s+1)} = \boldsymbol\beta^{(s)}+\Delta;\quad \Delta = -\left( \mathbf{J_r}^\top \mathbf{J_r} \right)^{-1} \mathbf{J_r}^\top \mathbf{r}.

\boldsymbol \beta^{(s+1)} = \boldsymbol \beta^{(s)} - \left(\mathbf{J_r}^\top \mathbf{J_r} \right)^{-1} \mathbf{ J_r} ^\top \mathbf{r}(\boldsymbol \beta^{(s)})\mathbf{J_r} = \frac{\partial r_i (\boldsymbol \beta^{(s)})}{\partial \beta_j} (wiki上的公式,我觉得分母写错了)

具体推导也可以看wiki

http://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm#Derivation_from_Newton.27s_method


m ≥ n 是必要的,因为不然的话,JrTJr肯定是不可逆的(利用rank(A'A) = rank(A))

若m=n:

\boldsymbol \beta^{(s+1)} = \boldsymbol \beta^{(s)} - \left( \mathbf{J_r} \right)^{-1} \mathbf{r}(\boldsymbol \beta^{(s)})


拟合问题中:

r_i(\boldsymbol \beta)= y_i - f(x_i, \boldsymbol \beta).

由于Jf = -Jr

故用 Jf而不是Jr来表示迭代公式:

\boldsymbol \beta^{(s+1)} = \boldsymbol \beta^{(s)} + \left(\mathbf{J_f}^\top \mathbf{J_f} \right)^{-1} \mathbf{ J_f} ^\top \mathbf{r}(\boldsymbol \beta^{(s)}).


normal equations(法方程):

delta实际是一个法方程,法方程是可以由方程推过来的,那么这个过程有什么意义呢?

其实高斯牛顿法可以利用一阶泰勒公式直接推导(直接当成解方程问题,而不是最优化问题,绕过求解H的过程):

迭代之后的值:

\mathbf{r}(\boldsymbol \beta)\approx \mathbf{r}(\boldsymbol \beta^s)+\mathbf{J_r}(\boldsymbol \beta^s)\Delta

目标近似变成:

\mathbf{min}\|\mathbf{r}(\boldsymbol \beta^s)+\mathbf{J_r}(\boldsymbol \beta^s)\Delta\|_2^2,

求解 delta 的过程实际就是法方程的推导过程。

之所以能够直接当做解方程问题,是因为目标函数的最优值要趋于0.

高斯牛顿法只能用于最小化平方和问题,但是优点是,不需要计算二阶导数。


Levenberg-Marquardt方法:

高斯-牛顿法中为了避免发散,有两种解决方法

1.调整下降步伐:\boldsymbol \beta^{s+1} = \boldsymbol \beta^s+\alpha\  \Delta. 0<\alpha<1

2.调整下降方向:\left(\mathbf{J^TJ+\lambda D}\right)\Delta=\mathbf{J}^T \mathbf{r}

\lambda\to+\inftyWhen: \Delta/\lambda\to \mathbf{J}^T \mathbf{r}(It seems that the wiki is wrong again), that is, the direction is the same as the gradient direction, and it becomes the gradient descent method.

Conversely, if λ is 0, it becomes Gauss-Newton's method.

The advantage of the Levenberg-Marquardt method is that it can be adjusted:

If the drop is too fast, use a smaller λ to make it closer to the Gauss-Newton method

If the descent is too slow, use a larger λ to make it closer to gradient descent

If reduction of S is rapid, a smaller value can be used, bringing the algorithm closer to the Gauss–Newton algorithm, whereas if an iteration gives insufficient reduction in the residual, λ can be increased, giving a step closer to the gradient descent direction.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324968291&siteId=291194637