What is a gradient?
General explanation:
The gradient of f(x) at x0: it is the direction in which f(x) changes the fastest
For example, f() is a mountain, standing halfway up the mountain,
Walking 1 meter in the x direction, the height rises by 0.4 meters, which means that the partial derivative in the x direction is 0.4
Walking 1 meter in the y direction, the height rises by 0.3 meters, which means that the partial derivative in the y direction is 0.3
In this way, the gradient direction is (0.4, 0.3), that is, if you walk 1 meter in this direction, the height you will rise is the highest.
(1*0.4/0.5)*0.4 +(1*0.3/0.5)*0.3 = 1*(0.3^2+0.4^2) = 0.5
The premise of using the Pythagorean theorem here is that the distance traveled is small enough, because when the distance traveled is small enough, the continuous surface is regarded as a plane. Of course, the Pythagorean theorem can be used. The distance of 1 meter is mentioned here for simplicity.
Walking 1 meter in the x direction, the height rises by 0.4 meters, which means that the partial derivative in the x direction is 0.4.
Walking 1 meter in the y direction, the height drops 0.3 meters, which means that the partial derivative in the y direction is -0.3, which means that it is rising along the -y axis.
Then the gradient direction is (0.4, -0.3), imagine that it is in the fourth quadrant of the coordinate axis, and the components along the x and -y axes will rise.
As in the case above, the distance is still 0.5
Therefore, the gradient is not only the direction in which f(x) changes the fastest at a certain point, but also the direction in which it rises the fastest, just as in the example of indoor temperature, the direction in which the temperature rises the fastest.
Gradient descent method:
From the above discussion, we can know that the gradient is the fastest rising direction, then if I want to go down the mountain, the fastest descending direction, of course, is against the gradient (approximating the surface near a point as a plane), this is the gradient descent method , because it is against the gradient, the descent is the fastest, also known as the steepest descent method.
Iterative formula:
r is the step size.
Newton's method:
(Partly borrowed from: http://blog.csdn.net/luoleicn/article/details/6527049 )
Solve the equation problem:
Newton's method was originally used to solve the equation root f(x) = 0
First get an initial solution x0,
First-order expansion: f(x) ≈ f(x0)+(x-x0)f'(x0)
令 f(x0)+(x-x0)f'(x0) = 0
求解得到x,相比于x0,f(x)<f(x0) (具体证明可以查看数值分析方面的书)
最优化问题中,牛顿法首先则是将问题转化为求 f‘(x) = 0 这个方程的根。
首先得到一个初始解 x0,
一阶展开:f ’(x) ≈ f ‘(x0)+(x-x0)f '’(x0)
令 f ‘(x0)+(x-x0)f '’(x0) = 0
求解得到x,相比于x0,f ‘(x)<f ’(x0)
也可以用上面借鉴的博文(也是wiki)中所说的那种方法,把 delta x 作为一个变量,让 f(x)对其偏导为0 什么的, 反正我记不住。
最小化f(x)之高维情况:
梯度 代替了低维情况中的一阶导
Hessian矩阵代替了二阶导
求逆 代替了除法
wiki上的一个图,可以看到 二者区别,
梯度下降法(绿色)总是沿着当前点最快下降的方向(几乎垂直于等高线),相当于贪心算法。
牛顿法利用了曲面本身的信息,能够更直接,更快的收敛。
高斯牛顿法:
高斯牛顿法实际上是牛顿法的在求解非线性最小二乘问题时的一个特例。
目标函数:
该函数是趋于0的,所以直接使用高维的牛顿法来求解。
迭代公式:
和牛顿法中的最优化问题高维迭代公式一样
目标函数可以简写:
,
梯度向量在方向上的分量:
(1)
Hessian 矩阵的元素则直接在梯度向量的基础上求导:
高斯牛顿法的一个小技巧是,将二次偏导省略,于是:
(2)
将(1)(2)改写成 矩阵相乘形式:
其中 Jr 是雅克比矩阵,r是 ri 组成的列向量。
代入牛顿法高维迭代方程的基本形式,得到高斯牛顿法迭代方程:
即
, (wiki上的公式,我觉得分母写错了)
具体推导也可以看wiki
http://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm#Derivation_from_Newton.27s_method
m ≥ n 是必要的,因为不然的话,JrTJr肯定是不可逆的(利用rank(A'A) = rank(A))
若m=n:
拟合问题中:
由于Jf = -Jr
故用 Jf而不是Jr来表示迭代公式:
normal equations(法方程):
delta实际是一个法方程,法方程是可以由方程推过来的,那么这个过程有什么意义呢?
其实高斯牛顿法可以利用一阶泰勒公式直接推导(直接当成解方程问题,而不是最优化问题,绕过求解H的过程):
迭代之后的值:
目标近似变成:
,
求解 delta 的过程实际就是法方程的推导过程。
之所以能够直接当做解方程问题,是因为目标函数的最优值要趋于0.
高斯牛顿法只能用于最小化平方和问题,但是优点是,不需要计算二阶导数。
Levenberg-Marquardt方法:
高斯-牛顿法中为了避免发散,有两种解决方法
1.调整下降步伐:.
2.调整下降方向:
When: (It seems that the wiki is wrong again), that is, the direction is the same as the gradient direction, and it becomes the gradient descent method.
Conversely, if λ is 0, it becomes Gauss-Newton's method.
The advantage of the Levenberg-Marquardt method is that it can be adjusted:
If the drop is too fast, use a smaller λ to make it closer to the Gauss-Newton method
If the descent is too slow, use a larger λ to make it closer to gradient descent
If reduction of S is rapid, a smaller value can be used, bringing the algorithm closer to the Gauss–Newton algorithm, whereas if an iteration gives insufficient reduction in the residual, λ can be increased, giving a step closer to the gradient descent direction.