Optimization - gradient descent

The method of gradient descent and differences

Gradient descent comprising: a batch gradient descent, stochastic gradient descent, small batch gradient descent, this classification is the amount of data required for each training classified according.

 

Batch gradient descent (BGD) at each gradient calculation, all the samples into the computation, the error is updated by calculating the gradient of all samples.

Advantages : each iteration are calculated for all the samples, this time using a matrix operation, to achieve parallelism. Further, by using the entire data set of gradient direction to better express general information updated to more accurately toward the extrema are located. When the objective function is a convex function, we will be able to find the global optimal solution.

Drawback : Since each step takes a lot of time and space, and therefore can not be applied to large-scale scenarios.

 

Stochastic gradient descent method (SGD) is trans Daoxing, by giving up a portion of the accuracy of the gradient, each iteration by a randomly selected sample of calculating the gradient direction to train the weights.

Advantages : Each iteration gradient is calculated using only one sample, thus greatly saving time and memory consumption, and therefore faster iteration.

Drawback : the accuracy of the drop, even when the objective function is strongly convex function is still difficult to achieve linear convergence. Since a single gradient is often not representative sample of the whole trend of the sample, it is easy to fall into local optimum. Even worse the situation will fall into the saddle point and the valley . And it is difficult to achieve parallelism.

 

Small batch gradient descent (MIni-batch GD) is a compromise and SGD BGD, the idea is to sample each iteration using a Mini-batch-size of the training update parameters.

Advantages : can realize parallel matrix operation, so as to optimize each neural network parameter and not much slower than the data in a single batch. And can accelerate the convergence, reduce the number of iterations for convergence, accuracy is not much worse than BGD.

Drawback : the size of the batch-size choices will affect the training effect. Such as: small batch-size can also cause shock.

Improved methods of Comparative SGD

Because each time you select a random sample of gradient descent training, so bring different samples of different sizes lead to a result of the shock gradient will appear in the training process, not the most stable approximation advantage, but easy to fall into local optimum. Also easy to fall into the valley or saddle point. When entering the valley, precise direction should be decreased down Hill Road, but oscillates back and forth, SGD on both sides of the mountain valleys, reducing the accuracy and convergence speed. And at the saddle point, stochastic gradient descent algorithm will enter a flat land, very far away from the lowest point at this time, but due to the gradient of the current position is almost zero can not continue to lead the search.

Stochastic gradient descent equation is $ \ theta_ {t + 1} = \ theta_ {t} - \ eta g_ {t} $, where $ g_ {t} $ gradient time $ t $

Thus there is a series of improvements: in two directions: to solve the problem of high variance shocks; sparsity characteristic decay rate and learning problems.

Method momentum (the Momentum)

It corresponds to an increase of the weight of the ball, such that in the valley due to gravity can be lowered quickly, at a saddle point may be due to the inertia of the flat strip out of saddle points increases. It takes into account the impact of the current pace of the history of the pace of progress

Its iterative formula is:

$ V_ {t} = \ gamma v_ {t-1} - \ eta g_ {t} $

$\theta_{t+1} = \theta_{t}-v_{t}$

Updated by increasing the momentum of the way. I.e., the pace of progress $ -v_ {t} $ by the current gradient $ g_ {t} $ pace of the previous step and $ v_ {t-1} $ decision, where the reflected inertia of the heavy use of the foregoing information. The current speed is equivalent to the previous step velocity and acceleration resulting interaction. Here $ \ gamma $ represent the role of resistance. When it is pointing to the actual moving direction of the gradient, momentum $ \ gamma v_ {t-1} $ is increased; when the opposite direction of the gradient and the actual movement, $ \ gamma v_ {t-1} $ reduced. That momentum only update the relevant sample parameters, reducing unnecessary parameters update.

 

Guess you like

Origin www.cnblogs.com/wzhao-cn/p/11279379.html