Neural Networks: Optimizers and Fully Connected Layers

SGD (Stochastic Gradient Descent)

The optimization algorithm of stochastic gradient descent is very commonly used in scientific research and industry.

Many theoretical and engineering problems can be transformed into mathematical problems of minimizing an objective function.

For example: Gradient Descent is like a person who wants to run from a high mountain to the lowest point of a valley, and use the fastest way to reach the lowest point.

The formula of SGD:

Momentum formula:

The basic mini-batch SGD optimization algorithm has achieved many good results in deep learning. However, there are also some problems that need to be solved:

  1. Choosing an appropriate initial learning rate is difficult.
  2. The learning rate adjustment strategy is subject to pre-specified adjustment rules.
  3. The same learning rate is applied to each parameter.
  4. How to avoid falling into a large number of local suboptimal solutions or saddle points during the optimization process of highly non-convex error functions.

AdaGrad (adaptive gradient)

AdaGrad optimization algorithm (Adaptive Gradient, adaptive gradient), which can adjust different learning rates for each different parameter, update frequently changing parameters with a smaller step size, and sparse parameters with a larger step size Make an update.

AdaGrad formula:

g t , i g_{t,i} gt,irepresents θ i \theta_{i} at time tiigradient.

G t , ii G_{t,ii}Gt , iiRepresents the parameter θ i \theta_{i} at time tiiThe sum of squared gradients.

The core difference from SGD is that when calculating the update step size, the denominator is added: the square root of the cumulative sum of squared gradients . This item can accumulate each parameter θ i \theta_{i}iiThe square of the historical gradient. If the gradient is updated frequently, the accumulated denominator will gradually become larger, so the update step size will become relatively smaller, while the sparse gradient will lead to the corresponding value in the accumulated denominator term being smaller, so the update step size will become smaller. The longer it is, the larger it is.

AdaGrad can automatically adapt to different learning rates for different parameters (the denominator term of the square root is equivalent to automatically adjusting the learning rate α, and then multiplying it by this gradient). Most framework implementations use the default learning rate α=0.01, that is, Better convergence can be achieved.

Advantages: In scenarios with sparse data distribution, it can better utilize sparse gradient information and converge more effectively than the standard SGD algorithm.

Disadvantages: The main flaw comes from the continuous accumulation of the square of the gradient in the denominator term. As time increases, the denominator term becomes larger and larger, eventually causing the learning rate to shrink to too small to be effectively updated.

RMSProp

RMSProp combines the exponential moving average of the squared gradient to adjust for changes in the learning rate. It can achieve good convergence in the case of unstable objective function.

Calculate the gradient at time t:

Calculate the exponential moving average of the squared gradient (Exponential Moving Average), γ \gammaγ is the forgetting factor (or exponential decay rate). Based on experience, the default setting is 0.9.

When updating the gradient, it is similar to AdaGrad, except that it updates the expectation of the square of the gradient (exponential moving mean), where ε = 1 0 − 8 \varepsilon = 10^{-8}e=108 to avoid dividing by 0. Default learning rateα = 0.001 \alpha = 0.001a=0.001

Advantages: It can overcome the problem of sharp gradient reduction in AdaGrad, and has demonstrated excellent learning rate adaptive capabilities in many applications. Especially under unstable (Non-Stationary) objective functions, it performs better than basic SGD, Momentum, and AdaGrad.

Adam

The Adam optimizer combines the advantages of the two optimization algorithms AdaGrad and RMSProp. The update step size is calculated by comprehensively considering the first moment estimation (i.e., the mean value of the gradient) and the second moment estimation (i.e., the uncentered variance of the gradient) of the gradient.

Adam’s advantages:

  1. The implementation is simple, the calculation is efficient, and it requires little memory.
  2. The update of parameters is not affected by the scaling transformation of the gradient.
  3. Hyperparameters are very interpretable and often require little or no tuning.
  4. The update step size can be limited to a rough range (the initial learning rate).
  5. The step annealing process (automatic adjustment of the learning rate) can be realized naturally.
  6. It is very suitable for scenarios with large-scale data and parameters.
  7. Suitable for unstable objective functions.
  8. It is suitable for problems where the gradient is sparse or the gradient has a lot of noise.

Adam’s implementation principle:

Calculate the gradient at time t:

Then calculate the exponential moving average of the gradient, m 0 m_{0}m0Initialized to 0.

Similar to the Momentum algorithm, the previously accumulated gradient momentum is comprehensively considered.

β 1 \beta_{1}b1The coefficient is an exponential decay rate, controlling the weight distribution of momentum and current gradient, and usually takes a value close to 1. Default is 0.9.

Next, calculate the exponential moving average of the squared gradient, v 0 v_{0}v0Initialized to 0.

β 2 \beta_{2}b2The coefficient is an exponential decay rate, controlling the influence of the previous gradient square. Default is 0.999.

Similar to the RMSProp algorithm, a weighted average of the squared gradients is performed.

Since m 0 m_{0}m0Initialized to 0, will result in mt m_{t}mtBiased towards 0, especially in the early stages of training.

Therefore, here we need to calculate the gradient mean mt m_{t}mtPerform bias correction to reduce the impact of bias on the initial stage of training.

At the same time v 0 v_{0}v0Also perform bias correction:

The final overall formula is as follows:

where the default learning rate α = 0.001 \alpha = 0.001a=0.001ε = 1 0 − 8 \varepsilon = 10^{-8}e=108 prevents the divisor from becoming 0.

It can be seen from the expression that the updated step size calculation can be adaptively adjusted from the two perspectives of gradient mean and gradient square, rather than being directly determined by the current gradient.

Adam's shortcomings:

Although the Adam algorithm has become a mainstream optimization algorithm, the best results in many fields (such as image recognition in computer vision and machine translation in NLP) are still obtained by using SGD with momentum.

The role of the fully connected layer

The fully connected layer maps the high-dimensional features learned by convolution to the label space and can be used as the classifier module of the entire network.

Although the parameters of the fully connected layer are redundant, they can maintain a large model capacity when the model performs transfer learning.

Currently, many models use global average pooling (GAP) to replace the fully connected layer to reduce model parameters and still achieve SOTA performance.

Guess you like

Origin blog.csdn.net/weixin_51390582/article/details/134980649