Deep Learning - Optimizer

optimizer

1 Introduction

insert image description here

Optimization algorithms can be divided into first-order optimization and second-order optimization algorithms. The first-order optimization refers to the gradient algorithm and its variants, while the second-order
optimization calculated using the second-order derivative (Hessian matrix), such as Newton's method. Due to the need Calculating the Hessian matrix and its inverse matrix requires
a large amount of calculation, so it has not become popular. Here we mainly summarize various gradient descent methods for first-order optimization.

The deep learning optimization algorithm has experienced the
development process of SGD -> SGDM -> NAG -> AdaGrad -> AdaDelta -> Adam -> Nadam.
insert image description here
class Adadelta: Optimizer that implements the Adadelta algorithm.

class Adagrad: Optimizer that implements the Adagrad algorithm.

class Adam: Optimizer that implements the Adam algorithm.

class Adamax: Optimizer that implements the Adamax algorithm.

class Ftrl: Optimizer that implements the FTRL algorithm.

class Nadam: Optimizer that implements the Nadam algorithm.

class Optimizer: Abstract optimizer base class.

class RMSprop: Optimizer that implements the RMSprop algorithm.

class SGD: Gradient descent (with momentum) optimizer.

1. SGD

insert image description here

1.1 vanilla SGD

insert image description here

1.2 SGD with Momentum

insert image description here

1.3 SGD with Nesterov Acceleration

2. AdaGrad

TensorFlow API: tf.keras.optimizers.Adagrad

3. RMSProp

tf.keras.optimizers.RMSprop

4. AdaDelta

tf. keras. optimizers. Adadelta

5. Adam

TensorFlow API: tf.keras.optimizers.Adam

optimizer selection

It is difficult to say that a certain optimizer performs well in all situations, we need to select an optimizer according to specific tasks. Some optimizers
perform well on computer vision tasks, others work well when RNN networks are involved, and even better with sparse data.
To summarize the above, based on the original SGD increasing momentum and Nesterov momentum, RMSProp is an improvement for the fast decay of AdaGrad learning rate
. It is very similar to AdaDelta, the difference is that AdaDelta uses the root mean square (RMS) of parameter update as the numerator
. Adam adds momentum and bias correction to RMSProp. If the data is sparse, it is recommended to use self-adaptive methods, namely
Adagrad, RMSprop, Adadelta, Adam. RMSprop, Adadelta, Adam are similar in many cases
. Adam performs better than RMSprop as the gradient becomes sparser. Overall, Adam is the best pick overall.
However, many papers only use vanilla SGD without momentum and a simple learning rate decay strategy. SGD can usually reach the minimum
point, but it may take longer than other optimizers. With a proper initialization method and learning rate strategy, SGD is more reliable
, but it may also be trapped in saddle points and minimum points. Therefore, when training a large and complex deep neural network, we want to
converge quickly, and an optimizer with an adaptive learning rate strategy should be used.
If you are just getting started, give priority to Adam or SGD+Nesterov Momentum.
There is no good or bad algorithm, the one that is most suitable for the data is the best, always remember: No free lunch theorem.

source

SGD (1952): https://projecteuclid.org/euclid.aoms/1177729392
SGD with Momentum (1999): https://www.sciencedirect.com/science/article/abs/pii/S0893608098001166 SGD with Nesterov Acceleration (1983 ): by Yurii Nesterov
AdaGrad (2011): http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
RMSProp (2012): http://www.cs.toronto.edu/~tijmen/ csc321/slides/lecture_slides_lec6.pdf
AdaDelta (2012): https://arxiv.org/abs/1212.5701
Adam: (2014) https://arxiv.org/abs/1412.6980
(very nice visualization of the above algorithm: https: https://imgur.com/a/Hqolp)

Guess you like

Origin blog.csdn.net/sexyluna/article/details/128359341