Deep learning optimizer (GD and SGD)

reference:

Why random steepest descent method (SGD) is a good way? https://www.leiphone.com/news/201709/c7nM342MTsWgau9f.html

 

1. GD

So "ä¹è¯'éæºæéä¸éæ³ (SGD) æ¯ä¸ä¸ªå¾å ¥ ½çæ¹æ³ï¼

Here  Why random steepest descent method (SGD) is a good way? is the first step position t, Why random steepest descent method (SGD) is a good way? is the derivative  Why random steepest descent method (SGD) is a good way? step size. So this algorithm is very simple, it is repeatedly done this iteration of the line.

Disadvantages:

(1) when in use, especially the application of machine learning, we are faced with very large data sets. If you insist on this time count  Why random steepest descent method (SGD) is a good way? of the exact derivative (do not control  Why random steepest descent method (SGD) is a good way? what is, anyway, inside each machine learning algorithms have such a thing), often means we have to spend a few hours the entire data set are scanned again, and then also just You can take a small step. To tens of thousands of steps GD general steps to converge, so this simply can not finish the run.

(2) If we are not careful into a saddle point, or relatively poor local optima, GD algorithms do not run out, because the derivative of these points is 0.

 

2. SGD

Expression and GD SGD algorithm similar:

Why random steepest descent method (SGD) is a good way?

Here  Why random steepest descent method (SGD) is a good way?it is called the Stochastic Gradient, it satisfies Why random steepest descent method (SGD) is a good way?

That is, while containing certain randomness, but from the expectation point of view, it is the equal right of derivatives. Use a diagram to represent, in fact, like a drunk SGD GD wine, it vaguely know the way, can finally go their own home, but go crooked. (Red is the GD line, SGD is partial pink line).

So "ä¹è¯'éæºæéä¸éæ³ (SGD) æ¯ä¸ä¸ªå¾å ¥ ½çæ¹æ³ï¼

In fact SGD more steps need to be able to converge, after all, it was drunk. However, due to its very low requirement derivative, it may contain a lot of noise, right on the line as long as desired (sometimes expectations are not possible), so the derivative counted very quickly . Examples of machine learning on I just said, such as neural networks now, the training time is a time to take the 128 or 256 data points from one million data points inside, not so accurate count a derivative, then go one step further with SGD of. Think about it, so each time to count 10,000 times faster, even walk the road several times, the math is quite a value.

advantage:

In practice, it was found that, in addition to count fast, SGD has a lot of excellent properties. Its performance is automatically escape the saddle point, the automatic escape local optima relatively poor, and, finally found the answer also has a strong general (generalization), that is not seen but can be subject to the same distribution of the data set before himself it is good!

Why SGD can escape saddle point?

First, we consider the derivative is zero points. These points are called Stationary points, i.e. stable point. Stable point, it may be a (local) minimum, (local) maximum, it may be a saddle point. How to judge it? We can calculate its Hessian matrix H.

  • If H is given negative, indicating that all the characteristic values ​​are negative. This time, no matter what direction you go, the derivative will become negative, that is a function of values ​​will decline. So, this is the (local) maximum.

  • If H is positive definite, a description of all the eigenvalues ​​are positive. This time, no matter what direction you go, the derivative will become positive, which means that the function value will rise. So, this is the (local) minimum.

  • If H contains both positive eigenvalues, and contains a negative eigenvalues, then the stable point is a saddle point. With specific reference to the previous picture. That function value will rise some direction, some direction function values ​​will decline.

  • Although it looks already contains all of the above cases, but in fact, no! There is also a very important case is H may contain the features of the case value of 0. Below this situation, we can not determine a stable point in the end belong to which category, often need to refer to higher-dimensional derivative of the job. Think about it, if the feature value is 0, it shows some vast flat expanse direction, function value has been the same, then we certainly do not know how it happens :)

We are discussing today contain only the first three, the fourth is not included. The fourth is called the situation degenerated, so we consider is called non-degenerate case.

In this case the following non-degenerate, we consider an important categories, namely strict saddle function. This function has such characteristic: for each point x

  • Either derivative of x is relatively large

  • X is the Hessian matrix comprises either a negative eigenvalues

  • X has either one from (partial) very close to the minimum

X Why should we meet the three conditions at least one do? because

  • If the derivative of x is large, then it can greatly reduce the number of function values ​​along the guide (we have to assume smoothness function)

  • If the Hessian matrix x has a negative eigenvalues, then we add random noise perturbations, it is possible to go to that running in this direction, this direction can be the same as the slides down a slippery, greatly reducing the value of the function

  • If x has left one (local) minimum close, and then we're done, after all, there is no perfect thing in this world, we are closer and went to the exact point of no difference.

So, if we consider the function of nature meet the strict saddle, then the SGD algorithm is actually not trapped in the saddle point. So strict saddle nature is not a reasonable nature of it?

In fact, there are a large number of functions using machine learning problems such properties are met. Such as Orthogonal tensor decomposition, dictionary learning, matrix completion and so on. And, in fact, not to worry about a local optimum point just end up with, rather than the global optimum. Because in fact it was found that a large number of machine learning problems, almost all of the local optimization is almost as good, that is, just find a local optimum, in fact, we have found the global optimum, such as Orthogonal tensor decomposition satisfied so nature, as well as Ma Ying-jeou NIPS16 the best student paper proves matrix completion also satisfy such properties. I think the neural network from some point of view, but also (almost) satisfied, just do not know how license.

The following discussion about the proof, to discuss The second major. The first paper is actually a mathematical language to say, "added the disturbance in the saddle point, be able to follow the direction of negative eigenvalues ​​slide down." The second is very interesting, I think it is worth to introduce the idea.

First of all, with a few changes on the algorithm. Algorithm is no longer SGD, but ran several steps GD, then run step SGD. Of course, in fact, it is not so used, but what theoretical analysis, so consider no problem. When SGD to run it? Only when the derivative is relatively small, and has been for a long time never ran SGD time, will run again. That is, only really stuck in the saddle point, and will look at random disturbance.

Because the saddle point has a negative characteristic value, then a little bit as long as the disturbance component in this direction, it is possible to slide down Yimapingchuan. Unless under very, very small component of the situation was likely to remain trapped in the vicinity of the saddle point. In other words, if the addition of a random disturbance, in fact, the probability of large cases can escape saddle point!

Although this idea is straightforward, but to prove rigorously that is not easy, because of the specific function may be very complex, Hessian matrix are constantly changing, so to explain "after disturbance probability will be stuck in the vicinity of the saddle point is a small probability" this thing is not easy.

The authors take a very clever way: for the direction of the negative eigenvalues, the projected distance between any two points in both directions as long as greater than u / 2, then there is at least one point between them can run through more than a few step GD fled the saddle point. In other words, the section will continue to fall in the vicinity of the saddle point is the point at which at most only u so wide! By calculating the width, we can also calculate the probability of the last, indicating a high probability that at SGD + GD algorithm is able to escape from the saddle point.

 

 

Guess you like

Origin blog.csdn.net/bl128ve900/article/details/94293284