The concept: Adam is an alternative to the traditional first-order stochastic gradient descent algorithm optimization process, based on the training data can iteratively update the neural network weights. Adam is the beginning of a OpenAI Diederik Kingma and Jimmy Ba submitted to the University of Toronto in 2015 ICLR paper (Adam: A Method for Stochastic Optimization ) proposed. The algorithm named "Adam", which is not an acronym, not the names. Its name comes from the adaptive torque estimate (adaptive moment estimation)
Adam (Adaptive Moment Estimation) with essentially RMSprop momentum term, and it is estimated second moment by moment estimation dynamically adjusting a learning rate parameter of each step gradient. Its main advantage is that after the offset correction, each iteration of the learning rate has a determined range, so that the parameter is relatively stable. The formula is as follows:
The first two formulas are the gradient first moment and second moment estimation estimation, can be seen as desirable E | gt |, E | gt ^ 2 | estimated;
formulas 3 and 4 are two first order correcting the estimated order moment, so it can be approximated to a desired non-biased estimate. It can be seen directly on the gradient estimated moment no extra memory requirements, and can be dynamically adjusted based on the gradient. Finally, a front portion of a dynamic learning constraint index n is formed, and a clear range.
advantage:
1, combines the advantages of good sparse gradient Adagrad RMSprop good and non-stationary target;
2, smaller memory requirements;
3, different adaptive learning rate is calculated for different parameters;
4, also apply to most of the non-convex optimization - for large data sets, and high-dimensional space.
Applications and source code:
Examples of parameters:
class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
Definition:
the params (Iterable): iterative optimization can be used to define the parameters or parameter groups dicts.
LR (a float, optional): learning rate (default: 1E-. 3) betas, ( Tuple [a float, a float], optional):
And calculating the average squared coefficients used in the gradient (Default: ( 0.9, 0.999)) EPS (a float, optional):
In order to improve the numerical stability is added to a term in the denominator of (default: 1E-. 8) weight_decay (a float, optional): the attenuation weight (e.g., L2 penalty) (default: 0)
torch.optim.adam Source:
1 import math 2 from .optimizer import Optimizer 3 4 class Adam(Optimizer): 5 def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,weight_decay=0): 6 defaults = dict(lr=lr, betas=betas, eps=eps,weight_decay=weight_decay) 7 super(Adam, self).__init__(params, defaults) 8 9 def step(self, closure=None): 10 loss = None 11 if closure is not None: 12 loss = closure() 13 14 for group in self.param_groups: 15 for p in group['params']: 16 if p.grad is None: 17 continue 18 grad = p.grad.data 19 state = self.state[p] 20 21 # State initialization 22 if len(state) == 0: 23 state['step'] = 0 24 # Exponential moving average of gradient values 25 state['exp_avg'] = grad.new().resize_as_(grad).zero_() 26 # Exponential moving average of squared gradient values 27 state['exp_avg_sq'] = grad.new().resize_as_(grad).zero_() 28 29 exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] 30 beta1, beta2 = group['betas'] 31 32 state['step'] += 1 33 34 if group['weight_decay'] != 0: 35 grad = grad.add(group['weight_decay'], p.data) 36 37 # Decay the first and second moment running average coefficient 38 exp_avg.mul_(beta1).add_(1 - beta1, grad) 39 exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) 40 41 denom = exp_avg_sq.sqrt().add_(group['eps']) 42 43 bias_correction1 = 1 - beta1 ** state['step'] 44 bias_correction2 = 1 - beta2 ** state['step'] 45 step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1 46 47 p.data.addcdiv_(-step_size, exp_avg, denom) 48 49 return loss
Use examples:
1 import torch 2 3 # N is batch size; D_in is input dimension; 4 # H is hidden dimension; D_out is output dimension. 5 N, D_in, H, D_out = 64, 1000, 100, 10 6 7 # Create random Tensors to hold inputs and outputs 8 x = torch.randn(N, D_in) 9 y = torch.randn(N, D_out) 10 11 # Use the nn package to define our model and loss function. 12 model = torch.nn.Sequential( 13 torch.nn.Linear(D_in, H), 14 torch.nn.ReLU(), 15 torch.nn.Linear(H, D_out), 16 ) 17 loss_fn = torch.nn.MSELoss(reduction='sum') 18 19 # Use the optim package to define an Optimizer that will update the weights of 20 # the model for us. Here we will use Adam; the optim package contains many other 21 # optimization algoriths. The first argument to the Adam constructor tells the 22 # optimizer which Tensors it should update. 23 learning_rate = 1e-4 24 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) 25 for t in range(500): 26 # Forward pass: compute predicted y by passing x to the model. 27 y_pred = model(x) 28 29 # Compute and print loss. 30 loss = loss_fn(y_pred, y) 31 print(t, loss.item()) 32 33 # Before the backward pass, use the optimizer object to zero all of the 34 # gradients for the variables it will update (which are the learnable 35 # weights of the model). This is because by default, gradients are 36 # accumulated in buffers( i.e, not overwritten) whenever .backward() 37 # is called. Checkout docs of torch.autograd.backward for more details. 38 optimizer.zero_grad() 39 40 # Backward pass: compute gradient of the loss with respect to model 41 # parameters 42 loss.backward() 43 44 # Calling the step function on an Optimizer makes an update to its 45 # parameters 46 optimizer.step()
Here, I believe that to deal with the vast majority of applications are possible up. My aim will basically complete. Next step is to deepen understanding of the application.
Reference documents:
1 https://blog.csdn.net/kgzhang/article/details/77479737
2 https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_optim.html