Deep Learning gradient descent

The main direction to solve the minimization problem, the basic idea is to continue to approach the most advantages, optimize every step of the direction of the gradient is

(1) stochastic gradient descent method:
Each time a random sample drawn from the sample set of updates

If you want to traverse the entire sample set, then we need to iterate many times, and each time the update was not carried out towards the optimal direction, so every step "very careful", that stochastic gradient descent learning rate α can not be set too large, however, likely to occur in the vicinity of the optimal solution, "shock", but has never been able phenomenon closer to the optimal solution.
But from another point of view, this "back and forth shocks" optimized route when there are many local minimum loss function, the model can effectively avoid falling into local optimal solution.

: (2) standard gradient descent method
and then the parameters are updated sample set after calculating the sum of the loss function

Performed after traversing the entire sample set fishes parameter updates, so it's going down direction is optimal direction, so it can be confidently every step.
Therefore, the learning rate of the algorithm in general than large stochastic gradient descent method. The disadvantage of this optimization method is that it requires updated every traverse the entire sample set, the efficiency is relatively low, since in many cases the entire sample set and the calculated gradient portion of the sample set of the computed gradient is not much difference.
(3) the batch gradient descent:
each random sample drawn from the focus M (batch_size) iteratively samples

Compared to the first two terms, both to improve the accuracy of the model, but also improve the speed of the algorithm.
(4) Momentum gradient descent:
also known as momentum is gradient descent method, the basic idea is: loss function optimal solution can be seen as a process for solving the ball from the surface (loss function value exhibited in the plane coordinate system) somewhere drops until at least the surface of the process along the surface, the gradient of the loss function can be considered as a force applied to the ball, by the action of a force with the speed, the position of the ball can be varied by the speed.

Wherein the momentum coefficient, size value, may be determined by trian-and-error, in practice often 0.9 to.
Does not immediately change the direction of the gradient optimized, and the weight value, i.e., the gradient direction to optimize every direction but the direction for the previous calculation and optimization of this time is calculated by accumulating a little altered, but the accumulation of greater. The benefits of this approach is that the optimized gradient obtained through different training samples, the gradient will always increase the value of the optimal direction, it is possible to reduce the number of shocks.
(5) Nesterov Momentum gradient descent method:
to a gradient descent Momentum improvement

It has been obtained, and you can then "forward-looking step", not solving gradient current position, but at solving gradient. Although this position is not correct, but better than the current position θ.

(6) AdaGrad gradient descent:
AdaGrad different learning rate can be adaptively

RMSProp AdaGrad optimization algorithm is an algorithm improvements, the core idea is to use an exponentially decaying moving average to drop the distant past history.
(. 7) gradient descent Adam
Adam gradients and considering the gradient of the square, and has the advantage AdaGrad of RMSprop. Adam first-order and second-order estimated gradient estimation, the learning rate is dynamically adjusted.

The first time the average gradient, the difference between the second time point to a non-central side of the gradient, generally set to 0.9, generally 0.9999 to, usually set 10-8.

This method not only stores the previous AdaDelta average squared gradient of the exponential decay, but maintains the previous average value a gradient of the exponential decay M (t), which is similar to and momentum.

Learning rate (learning rate), control model of learning progress

The beginning of training: learning rate is appropriate from 0.01 to 0.001.
After a certain number of rounds: slowing.
Near the end of the training: learning rate decay should be more than 100 times.

Guess you like

Origin www.cnblogs.com/hello-bug/p/12524805.html