Optimizer (1)

What is an optimizer?

explain

If we define a machine learning model, such as a three-layer neural network, then we need to make this model fit the provided training data as closely as possible. But how do we evaluate whether the model fits the data well enough? Then you need to use corresponding indicators to evaluate its fitting degree. The function used is called the loss function . When the value of the loss function decreases, we think that the model has taken another step forward on the road of fitting. The best fit of the final model to the training data set is when the value of the loss function is minimum, which is when the average value of the loss function is minimum on the specified data set.

Since it is generally difficult for us to directly and accurately calculate the value of the model parameters when the loss function is the smallest, we can move the parameters in the "field" of the loss function in the direction in which the value of the loss function decreases. Finally, when it converges, an approximate solution of the minimum value is obtained. In order to reduce the value of the loss function, you need to use an optimization algorithm for optimization . The direction in which the value of the loss function decreases the fastest is called the negative gradient direction. The algorithm used is called the gradient descent method, that is, the steepest descent method. ). Currently, almost all machine learning optimization algorithms are based on gradient descent algorithms.

To sum up, the optimizer (such as the gradient descent method) guides the parameters of the loss function (objective function) to update the appropriate size in the right direction during the deep learning backpropagation process, so that the updated parameters make the loss function The (objective function) value continues to approach the global minimum.

principle

The optimization problem can be seen as us standing at a certain location on the mountain (current parameter information) and wanting to take the best route to the bottom of the mountain (the optimal point). First of all, the intuitive way is to look around and find the fastest way down the mountain and take a step, then look around again and find the fastest direction until we go down the mountain - this method is simple gradient descent - the current altitude is our The target (loss) function value, and the direction we find at each step is the opposite direction of the function gradient (the gradient is the direction in which the function rises fastest, so the opposite direction of the gradient is the direction in which the function decreases fastest).

Optimization using gradient descent is the core idea of ​​almost all optimizers. When we go down the mountain, there are two aspects that we are most concerned about:

  • The first is the optimization direction, which determines "whether the direction of progress is correct", which is reflected as gradient or momentum in the optimizer.

  • The second step is the step size, which determines "how far each step takes" and is reflected in the optimizer as the learning rate.

All optimizers are focusing on these two aspects, but there are also some other issues, such as where to start and how to handle route errors... This is the direction that some of the latest optimizers focus on.

effect

Gradient descent is one of the common optimization algorithms in machine learning. The gradient descent method has the following functions:

  • Gradient descent is a type of iterative method that can be used to solve least squares problems (both linear and nonlinear). Gradient descent can also be used for other problems as long as the loss function is differentiable, such as cross-entropy loss, etc.

  • When solving the model parameters of machine learning algorithms, that is, unconstrained optimization problems, there are mainly gradient descent methods, Newton's methods, etc.

  • When solving the minimum value of the loss function, the gradient descent method can be used to iteratively solve the problem step by step to obtain the minimized loss function and model parameter values.

  • If we need to find the maximum value of the loss function, we can iterate through the gradient ascent method. Gradient descent method and gradient ascent method can be converted into each other.

Gradient descent variant

Depending on the amount of data used to calculate the gradient of the objective function, there are three variants of gradient descent, namely batch gradient descent, stochastic gradient descent, and mini-batch gradient descent . Depending on the size of the data, a trade-off is made between the accuracy of parameter updates and the time required to perform the updates.

batch gradient descent

Standard gradient descent, batch gradient descent (BGD), calculates the loss function on the entire training set with respect to the parameters θ θθ function: θ = θ − η ∇ θ J ( θ ) \theta=\theta-\eta \nabla_\theta J(\theta) .i=iηiJ ( θ ) whereθ θθ is the parameter of the model,η ηeta is the learning rate,∇ θ J ( θ ) ∇_θJ(θ)iJ ( θ ) is the loss function for the parameterθ θDerivative of θ . BGD can be very slow since we need to compute gradients over the entire training set for a single parameter update, and can be tricky when the training set is too large to fit into memory. BGD also does not allow us to update model parameters online, that is, add new training samples in real time.

BGD is guaranteed to converge to the global optimal point for convex error surfaces, and to the local optimal point for non-convex surfaces.

stochastic gradient descent

Stochastic gradient descent (SGD) uses one training sample xix^i at a timexi和标签yiy^iyLet us determine the equation:θ = θ − η ∇ θ J ( θ ; xi ; yi ) \theta=\theta-\eta \nabla_\theta J(\theta;x^i;y^i);i=iηiJ(θ;xi;yi )Among themθ θθ is the parameter of the model,η ηeta is the learning rate,∇ θ J ( θ ) ∇_θJ(θ)iJ ( θ ) is the loss function for the parameterθ θDerivative of θ . BGD performs a lot of redundant calculations for large data sets because the gradients of many similar samples must be calculated before each parameter update. SGD resolves this redundancy by performing updates one at a time. Therefore, SGD is usually very fast and can be used for online learning. SGD is characterized by high variance and performs continuous parameter updates, resulting in serious oscillations of the objective function.

Insert image description here

BGD is able to converge to a (local) optimal point, whereas the oscillatory characteristics of SGD cause it to jump to new potentially better local optimal points. Studies have shown that when we slowly reduce the learning rate, SGD has the same convergence performance as BGD, and can almost reach the local or global optimum for non-convex and convex surfaces.

mini-batch gradient descent

Mini-batch gradient descent (MBGD) takes a compromise between the above two methods: each time batchsize samples are taken from the training set as a mini-batch, and parameters are calculated once Update: θ = θ − η ∇ θ J ( θ ; x ( i : i + n ) ; y ( i : i + n ) ) \theta=\theta-\eta \nabla_\theta J(\theta;x^ {(i:i+n)};y^{(i:i+n)})i=iηiJ(θ;x(i:i+n);y( i : i + n ) )whereθ θθ is the parameter of the model,η ηη是学习率,∇ θ J ( θ ; x ( i : i + n ) ; y ( i : i + n ) ) \nabla_\theta J(\theta;x^{(i:i+n)}; y^{(i:i+n)})iJ(θ;x(i:i+n);y( i : i + n ) )is the loss function for the parametersθ θDerivative of θ , nnn is the size of mini-bach (batch size). The larger the batch size, the fewer the batches, and the training time will be faster, but it may cause a lot of waste of data; the smaller the batch size, the more fully the data will be utilized, and the less amount of data will be wasted, but the batch will be Larger, training will be more time consuming.

Reduce the variance of parameter updates to achieve more stable convergence. The most advanced deep learning library is highly optimized for matrix operations, which can make the calculation of mini-batch gradients more efficient.

If the number of samples is large, the general mini-batch size is 64 to 512. Considering the way the computer memory is set up and used, if the mini-batch size is 2 to the nth power, the code will run faster, and 64 is 2 6th power, and so on, 128 is 2 to the 7th power, 256 is 2 to the 8th power, and 512 is 2 to the 9th power. So I often set the mini-batch size to a power of 2.

MBGD is a common method when training neural networks, and often the term SGD is used instead even if MBGD is actually used.

the problem we are facing

Learning rate choice

Choosing a good learning rate is very difficult. A learning rate that is too small will lead to very slow convergence, while a learning rate that is too large will hinder convergence, causing the loss function to oscillate near the optimal point or even diverge . The same learning rate is applied to all parameter updates. If our data is sparse and features have many different frequencies, then we may not want to update them to the same extent at this time, but instead give larger updates to features that appear less often. In order to automatically adjust the learning rate during learning, the learning rate is reduced according to a previously defined rule, or when the change in the objective function between two iterations is below a threshold. However, these rules and thresholds also need to be defined before training, so they cannot achieve the characteristics of adaptive data.

Insert image description here
In the above figure, the learning rate is set too large, causing the objective function value to fluctuate greatly along the "valley" and may never reach the minimum value.

saddle point

Another key challenge for neural networks in minimizing highly non-convex error functions is to avoid getting trapped in their numerous suboptimal points. In fact, the difficulty comes from the saddle point rather than the local optimal point , that is, the loss function slopes up at this point in one dimension and slopes down in the other dimension. These saddle points are usually surrounded by a plane with the same error, which makes it very difficult for SGD to escape because the gradient approaches 0 in all dimensions.

Insert image description here

As shown in the picture, the saddle point gets its name from its shape that resembles a saddle. Although it is a minimum point in the x-direction, it is a local maximum point in the other direction, and if it becomes flatter along the x-direction, the gradient descent will oscillate in the x-axis and cannot continue according to The y-axis decreases, which gives us the illusion that it has converged to the minimum point.

Guess you like

Origin blog.csdn.net/weixin_49346755/article/details/127345925