Depth learning commonly used optimization method

The following methods are summarized Andrew Ng depth way to learn the course.

(1) gradient descent

batch-GD (size = m) : slow, but every time the optimal direction;
stochastic gradient descent (size = 1): not accelerate vectors, relatively slow, and will ultimately be at the optimum value hovering around;
Mini-BATCH (size = 16,32,64,128): faster, while also hovering between the optimum value, the learning rate may be adjusted so that it reaches an optimum value;

(2) gradient descent with momentum (the Momentum)

The actual momentum method is the use of a weighted average of the index gradient past into account, so that the update process smoother
algorithm:
Vdw and Vdb initialized to zero, the value of β is commonly 0.9 (on a right moment weight).
b
When we momentum gradient descent algorithm, due to the use of an exponentially weighted average method. The original fluctuate in the longitudinal direction, through after averaging close to zero, volatility on the vertical axis becomes very small; but on the horizontal axis direction, all point to the differential horizontal axis, so that the average remains great. Ultimately decrease the gradient curve shown in the red line.

(3)RMS-prob

Momentum In addition to the above-mentioned gradient descent method, RMSprop (root mean square prop) is a way to accelerate the gradient descent algorithm. Algorithm to achieve the same sample as shown below:
Here Insert Picture DescriptionIt is assumed that the parameter b in the longitudinal direction of the gradient, the parameter w in the gradient of the horizontal axis (of course, in practice, is a high dimension), using RMSprop algorithm, can be reduced a these large dimensions gradient update situation fluctuations (w we want direction, i.e. the horizontal direction a little faster, b direction, i.e. the vertical direction a little slower), as shown in the blue line, it becomes gradient descent speed faster, green line as shown in FIG.
In the implementation shown in the figures, RMSprop the differential term squared, the square root is then used to update a gradient (smooth curve), and in order to ensure the algorithm does not divide by zero, the square root of the denominator will be added in a very practical use the small value ε = 10-8.

(4)Adam( Adaptive Moment Estimation )

Adam optimization algorithm The basic idea is to combine RMSprop Momentum and the formation of a deep learning for different structures (wide and effective application) optimization algorithm.
Here Insert Picture DescriptionHere Insert Picture Description

Attenuation (5) learning rate

When we use the mini-batch gradient descent algorithm to find the minimum value of Cost function, if we set a fixed learning rate [alpha], then the algorithm reaches the vicinity of the minimum point, since some noise is present in a different batch, such that no convergence will be accurate, but will have a larger fluctuate within the range of a minimum point, the blue line as shown in FIG.
But if we learn to use the rate of decay, decreasing the learning rate α, in the beginning of the algorithm, the learning rate is still relatively fast, relatively rapid decline in the direction of the minimum point. However, the α decreases, the pace of decrease will be gradually reduced, eventually a smaller fluctuation in the area near the minimum value, the green line as shown in FIG.
Here Insert Picture DescriptionHere Insert Picture Description

(6) local optimum problem

In the case of low dimensions, we might imagine a Cost function may be shown on the left, there are some local minimum point, when the initialization parameters, shall not be treated if the initial value is selected, there will be trapped in local optima sex.
But if we build a neural network, usually gradient zero point, not as local best left in, but right in the saddle point (saddle point is called because of its shape like the shape of the saddle).
Here Insert Picture DescriptionHaving a function in a high dimensional space, if the gradient is 0, then in each direction, Cost function may be a convex function, there may be a concave function. However, if the parameter dimension 2 World Wide Web, local optima want to get, then all dimensions are required to be concave function, the probability is 2-20000
, the possibility is very small. In other words, in the case of local low latitudes of the most advantages, does not apply to high latitudes, we gradient of 0 points more likely to be a saddle point.
In the case of high latitudes:

  • Almost impossible to fall into local minimum point;
  • Stagnation zone in the saddle point will slow down the learning process, to improve the use of algorithms such as Adam.

Guess you like

Origin blog.csdn.net/qq_41332469/article/details/89811197