optimizer learning

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right


foreword

Optimization is a branch of applied mathematics, which mainly studies the maximum and minimum values ​​of functions under special circumstances.
There are generally two types of optimization objectives: convex functions and non-convex functions. The global minimum of the convex function generally coincides with the local minimum. Among them, there are many local minimum points of non-convex functions, which do not coincide with the global minimum points of convex functions.
In deep learning, due to the nonlinear characteristics of the activation function and the complexity of the network, the target to be optimized is a very complex non-convex function.
Optimization methods in deep learning can generally be divided into two categories. The
first category is based on the update direction (SGD, Momentum)
and the second category is based on choosing a more suitable learning rate ()


1. SGD (Stochastic Gradient Descent Algorithm)

Suppose the objective function is
insert image description here
the image of the function and the contour map.
insert image description here
First, the gradient of an objective function refers to the tangent line perpendicular to the contour line, and towards this direction, the value of the objective function rises fastest.

So the stochastic gradient descent algorithm is to continuously update the gradient of the parameters, and the direction of the update is in this direction, and the value of the objective function drops the fastest. The updated value is relative to the step size.

W = W - step size * gradient

The update path using the stochastic gradient descent algorithm is as follows:
insert image description here
The characteristic of the stochastic gradient descent algorithm is that it is simple to implement, but the efficiency is not high

2. Momentum

Momentum's formula is as follows:
insert image description here
When updating the gradient. There is a V vector, how to understand the impact of this vector on the update gradient? Observe the picture below.
insert image description here
At point B, the direction of the V vector is the gradient update direction of point A, which is the previous point of point B. At this time, the gradient direction of point B is shown in the figure. If it is the SGD algorithm, the entire objective function will be updated towards the gradient direction of point B. , if the V vector is added, the update gradient will be taken in the middle direction between the gradient update direction of point A and the gradient update direction of point B. The advantage of doing this is that
the update will be faster and more efficient, and it will not turn around like SGD.

The update path using momentum is as follows:
insert image description here


3. AdaGrad

The following introductions all belong to the second category, mainly if the learning rate (step size) is updated. There is a method called learning rate decay, which is to make the learning rate gradually decrease as the learning progresses.

AdaGrad will properly adjust the learning rate for each element. The update method of AdaGrad is as follows:
insert image description here
h represents the sum of the gradient squares of all previous parameters accumulated. When updating the parameters, it can be made by multiplying by 1/root h, if If the gradient value of the parameter changes greatly, the learning rate of the corresponding parameter will gradually decrease.
The update path is as follows:
insert image description here
Of course, there are certain problems with this algorithm, that is, if there are many training times and h keeps accumulating, it will cause h to tend to infinity, and 1/h will tend to 0, which will cause the parameters to fail to be updated. Aiming at this problem, the RMSprop algorithm is proposed. The idea of ​​RMSprop is to add an attenuation factor so that the current gradient has a greater impact on the step size, while the previously accumulated gradient has a smaller impact on the step size. The formula is as follows:
insert image description here

4. Adam Algorithm

The Adam algorithm is a fusion of the two methods of Momentum and AdaGrad. That is, the direction of parameter gradient update is optimized, and the learning rate is optimized.
The update path using Adam is as follows:
insert image description here

Guess you like

Origin blog.csdn.net/weixin_47250738/article/details/131616651