Machine Learning && Deep Learning - Stochastic Gradient Descent Algorithm (and its Optimization)

When we have no way to get an analytical solution, we can use gradient descent for optimization, which can be used for almost all deep learning models.
Regarding optimization, I have studied intelligent scheduling algorithms and optimization myself, so I have some basic ideas about how to find local minima and how to jump out of local minima. It is also easy to learn stochastic gradient algorithm and its optimization. not difficult.

Gradient Descent

The gradient descent method is a first-order optimization algorithm, usually called the steepest descent method. It is an algorithm that finds a local minimum value of the function by using the specified step distance to iteratively search in the opposite direction of the gradient corresponding to the current point of the function. The good case is to find the global minimum.

stochastic gradient descent

But if you use the gradient descent method directly, you need to use all the samples every time you update the parameters. If the total number of samples is too large, it will have a great impact on the speed of the algorithm, so there is a stochastic gradient descent algorithm.
It is an improvement to the gradient descent algorithm, and only randomly selects a part of the samples for optimization each time. The number of samples is generally an integer power of 2, and the value range is 32~256, so as to ensure the calculation accuracy and improve the calculation speed at the same time. Optimizing the most commonly used class of algorithms in deep learning networks.
In training, it usually uses a fixed learning rate for training, namely:
gt = ▽ θ t − 1 f ( θ t − 1 ) ▽ θ t = − η ∗ gt where gt is the gradient of step t, η is the learning rate g_t=▽_{θ_{t-1}}f(θ_{t-1})\\ ▽_{θ_t}=-η*g_t\\ Among them, g_t is the gradient of step t, η is the learning rategt=it1f ( it1)it=hgtwhere gtis the gradient of the t- th step, η is the learning rate
stochastic gradient descent algorithm completely depends on the gradient obtained by the current batch data during optimization, and the learning rate is a parameter to adjust the influence of the gradient. By controlling the size of the learning rate η, a certain To some extent, the network training speed can be controlled.

Problems with Stochastic Gradient Descent Algorithm

Stochastic gradient descent is very effective for most situations, but there are still defects:
1. It is difficult to determine the appropriate η, and using the same learning rate for all parameters may not be very effective. In this case, the training method of changing the learning rate can be used. For example, the control network updates the parameters with a large learning rate in the early stage, and updates the parameters with a small learning rate in the later stage (in fact, it is similar to the cross mutation probability in the genetic algorithm. You can To understand the idea of ​​adaptive genetic algorithm, the reason is the same)
2. It is easier to converge to the local optimal solution, and when it falls into the local optimal solution, it is not easy to jump out. (In fact, it is also similar to the problems that the genetic algorithm may encounter. At that time, it was combined with the simulated annealing algorithm to solve the problem of premature convergence. The real idea is to increase the probability of mutation. If there is a mutation, it is likely to jump out of the local optimum)

Standard Momentum Optimization

Momentum updates the parameters in the network by simulating the inertia of the object when it is moving, that is, the direction of the previous parameter update will be considered to a certain extent during the update, and at the same time, the gradient calculated by the current batch is used to combine the two to calculate the final parameters that need to be updated. size and direction.
Introducing the idea of ​​momentum in optimization aims to accelerate learning , especially in the face of small, continuous and noisy gradients. Using momentum not only increases the stability of learning parameters, but also learns converged parameters faster.
After the momentum is introduced, the parameter update method of the network:
gt = ▽ θ t − 1 f ( θ t − 1 ) mt = μ ∗ mt − 1 + gt ▽ θ t = − η ∗ mtmt is the accumulation of the current momentum μ belongs to the momentum Factor, used to adjust the importance of the previous momentum to the parameter g_t=▽_{θ_{t-1}}f(θ_{t-1})\\ m_t=μ*m_{t-1}+g_t\\ ▽_{θ_t}=-η*m_t\\ m_t is the accumulation of the current momentum \\ μ belongs to the momentum factor, which is used to adjust the importance of the previous momentum to the parametersgt=it1f ( it1)mt=mmt1+gtit=hmtmtis the accumulation of the current momentumμ belongs to the momentum factor, which is used to adjust the importance of momentum to parameters
in the previous step. In the early stage of network update, the last parameter update can be used. At this time, the direction of decline is the same, and multiplying by a larger μ can accelerate well; in the network update In the later stage, as the gradient gradually tends to 0, when the local minimum value oscillates back and forth, the momentum is used to increase the update range and jump out of the trap of the local optimal solution.

Nesterov momentum optimization

The Nesterov term (Nesterov momentum) is a correction made when the gradient is updated to avoid parameter updates too fast while improving sensitivity . In momentum, the previously accumulated momentum will not affect the current gradient, so Nesterov’s improvement is to let the previous momentum directly affect the current momentum, that is: gt = ▽ θ t − 1 f (
θ t − 1 − η ∗ μ ∗ mt − 1 ) mt = μ ∗ mt − 1 + gt ▽ θ t = − η ∗ mt g_t=▽_{θ_{t-1}}f(θ_{t-1}-η*μ*m_{t -1})\\ m_t=μ*m_{t-1}+g_t\\ ▽_{θ_t}=-η*m_tgt=it1f ( it1themmt1)mt=mmt1+gtit=hmt
The difference between Nesterov momentum and standard momentum is that in the calculation of the current batch gradient, the gradient calculation of Nesterov momentum is the gradient after the current velocity is applied . So it can be seen as adding a correction factor to the standard momentum method, thereby improving the algorithm update performance.
At the beginning of training, the parameters may be far away from the best quality, requiring a larger learning rate. After several rounds of training, reduce the training learning rate (in fact, it is similar to the idea of ​​​​the adaptive genetic algorithm) . Therefore, many adaptive learning rate algorithms Adadelta, RMSProp and adam have been proposed.

おすすめ

転載: blog.csdn.net/m0_52380556/article/details/131864650