Deep Learning: Introduction to Common Optimizer Optimizer

Stochastic Gradient Descent SGD

The gradient descent algorithm is to make the weight parameters drop along the gradient direction of the entire training set, but often the training set of deep learning is large in scale, and calculating the gradient of the entire training set requires a large amount of calculation. In order to reduce the amount of calculation and accelerate training, here Based on this, a stochastic gradient descent algorithm (SGD) is evolved, which descends along the gradient direction of randomly selected small batches of data.
Suppose the weight is denoted as www , the learning rate isα \alphaα , Randomly select a small batch of samples to calculate the gradientdw dwd w , the formula for updating the weight of the model is as follows:
wt + 1 = wt − α × dwt w_{t+1} = w_t - \alpha \times dw_twt+1=wta×dwt

Stochastic Gradient Descent with Momentum SGD-Momentum

Although stochastic gradient descent is a popular optimization method, its learning process is sometimes slow. The introduction of momentum momentum aims to improve the convergence speed and convergence accuracy , especially when dealing with high curvature, small but consistent gradients, or with The gradient of the noise.
Momentum is a hyperparameter used to update model parameters in deep learning training. Assuming it is recorded as mu, the formula of stochastic gradient descent algorithm that introduces momentum is: vt = mu × vt − 1 − α t ×
dwt v_t = mu \times v_{t-1} - \alpha_t \times dw_tvt=m u×vt1at×dwt
w t + 1 = w t + v t w_{t+1} = w_t + v_t wt+1=wt+vt
Among them, v is initialized to 0, and the general values ​​of mu are 0.5, 0.9, 0.99, etc.
If the gradient direction of the current moment is similar to that of the historical moment, this trend will be strengthened at the current moment; if it is different, the gradient direction of the current moment will be weakened. The former can speed up the convergence, and the latter can reduce the swing and improve the convergence accuracy.

SGDW

The use of weight decay (weight decay) is neither to improve the convergence accuracy nor to improve the convergence speed. Its ultimate purpose is to prevent overfitting . In the loss function, weight decay is a coefficient placed in front of the regularization term (regularization). The regularization term generally indicates the complexity of the model, so the role of weight decay is to adjust the model complexity to prevent overfitting. If the weight decay is large, Then the value of the complex model loss function is also large.

SGDW is SGD+ Weight decate. SGDW directly adds the gradient of the regular term to the backpropagation formula instead of the loss function.
For detailed algorithm, please refer to:
insert image description here

Adam

Adam is an adaptive optimizer that is robust to the choice of hyperparameters. SGD-Momentum adds first-order momentum to SGD, and AdaGrad and AdaDelta add second-order momentum to SGD. Adam uses both the first-order momentum and the second-order momentum.
First-order momentum:
mt = β 1 × mt − 1 + ( 1 − β 1 ) × dwt m_t = \beta_1 \times m_{t-1} + (1-\beta_1) \times dw_tmt=b1×mt1+(1b1)×dwt
Second-order momentum:
vt = β 2 × vt − 1 + ( 1 − β 2 ) × d 2 wt v_t = \beta_2 \times v_{t-1} + (1-\beta_2) \times d^2w_tvt=b2×vt1+(1b2)×d2w _t
β 1 \beta_1b1and β 2 \beta_2b2are the two hyperparameters of Adam.

For the detailed algorithm, please refer to Adam's original paper:
insert image description here

AdamW

AdamW is an adaptive optimizer developed on the basis of Adam. AdamW is Adam + Weight decate, the effect is the same as Adam + L2 regularization, but the calculation efficiency is higher, because L2 regularization needs to add a regular term to the loss, then calculate the gradient, and finally backpropagate, and AdamW directly regularizes The gradient of the term is added to the backpropagation formula, which saves the step of manually adding regular terms to the loss.

For the detailed algorithm, please refer to AdamW's original paper:
insert image description here

Guess you like

Origin blog.csdn.net/weixin_43603658/article/details/131917406