PyTorch-Adam principle of optimization algorithms, formulas, application

 The concept: Adam is an alternative to the traditional first-order stochastic gradient descent algorithm optimization process, based on the training data can iteratively update the neural network weights. Adam is the beginning of a OpenAI Diederik Kingma and Jimmy Ba submitted to the University of Toronto in 2015 ICLR paper (Adam: A Method for Stochastic Optimization ) proposed. The algorithm named "Adam", which is not an acronym, not the names. Its name comes from the adaptive torque estimate (adaptive moment estimation)

  Adam (Adaptive Moment Estimation) with essentially RMSprop momentum term, and it is estimated second moment by moment estimation dynamically adjusting a learning rate parameter of each step gradient. Its main advantage is that after the offset correction, each iteration of the learning rate has a determined range, so that the parameter is relatively stable. The formula is as follows:

  

  The first two formulas are the gradient first moment and second moment estimation estimation, can be seen as desirable E | gt |, E | gt ^ 2 | estimated; 
formulas 3 and 4 are two first order correcting the estimated order moment, so it can be approximated to a desired non-biased estimate. It can be seen directly on the gradient estimated moment no extra memory requirements, and can be dynamically adjusted based on the gradient. Finally, a front portion of a dynamic learning constraint index n is formed, and a clear range.

  advantage:

1, combines the advantages of good sparse gradient Adagrad RMSprop good and non-stationary target; 
2, smaller memory requirements; 
3, different adaptive learning rate is calculated for different parameters; 
4, also apply to most of the non-convex optimization - for large data sets, and high-dimensional space.

  Applications and source code:

  Examples of parameters:

class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

  Definition:

  the params (Iterable): iterative optimization can be used to define the parameters or parameter groups dicts.

  LR  (a float, optional): learning rate (default:  1E-. 3)  betas,  ( Tuple [a float, a float], optional):

  And calculating the average squared coefficients used in the gradient (Default: ( 0.9,  0.999))  EPS  (a float, optional):

  In order to improve the numerical stability is added to a term in the denominator of (default:  1E-. 8)  weight_decay  (a float, optional): the attenuation weight (e.g., L2 penalty) (default:  0)

  torch.optim.adam Source:

Copy the code
 1 import math
 2 from .optimizer import Optimizer
 3 
 4 class Adam(Optimizer):
 5     def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,weight_decay=0):
 6         defaults = dict(lr=lr, betas=betas, eps=eps,weight_decay=weight_decay)
 7         super(Adam, self).__init__(params, defaults)
 8 
 9     def step(self, closure=None):
10         loss = None
11         if closure is not None:
12             loss = closure()
13 
14         for group in self.param_groups:
15             for p in group['params']:
16                 if p.grad is None:
17                     continue
18                 grad = p.grad.data
19                 state = self.state[p]
20 
21                 # State initialization
22                 if len(state) == 0:
23                     state['step'] = 0
24                     # Exponential moving average of gradient values
25                     state['exp_avg'] = grad.new().resize_as_(grad).zero_()
26                     # Exponential moving average of squared gradient values
27                     state['exp_avg_sq'] = grad.new().resize_as_(grad).zero_()
28 
29                 exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
30                 beta1, beta2 = group['betas']
31 
32                 state['step'] += 1
33 
34                 if group['weight_decay'] != 0:
35                     grad = grad.add(group['weight_decay'], p.data)
36 
37                 # Decay the first and second moment running average coefficient
38                 exp_avg.mul_(beta1).add_(1 - beta1, grad)
39                 exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
40 
41                 denom = exp_avg_sq.sqrt().add_(group['eps'])
42 
43                 bias_correction1 = 1 - beta1 ** state['step']
44                 bias_correction2 = 1 - beta2 ** state['step']
45                 step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
46 
47                 p.data.addcdiv_(-step_size, exp_avg, denom)
48 
49         return loss
Copy the code

  Use examples:

Copy the code
 1 import torch
 2 
 3 # N is batch size; D_in is input dimension;
 4 # H is hidden dimension; D_out is output dimension.
 5 N, D_in, H, D_out = 64, 1000, 100, 10
 6 
 7 # Create random Tensors to hold inputs and outputs
 8 x = torch.randn(N, D_in)
 9 y = torch.randn(N, D_out)
10 
11 # Use the nn package to define our model and loss function.
12 model = torch.nn.Sequential(
13     torch.nn.Linear(D_in, H),
14     torch.nn.ReLU(),
15     torch.nn.Linear(H, D_out),
16 )
17 loss_fn = torch.nn.MSELoss(reduction='sum')
18 
19 # Use the optim package to define an Optimizer that will update the weights of
20 # the model for us. Here we will use Adam; the optim package contains many other
21 # optimization algoriths. The first argument to the Adam constructor tells the
22 # optimizer which Tensors it should update.
23 learning_rate = 1e-4
24 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
25 for t in range(500):
26     # Forward pass: compute predicted y by passing x to the model.
27     y_pred = model(x)
28 
29     # Compute and print loss.
30     loss = loss_fn(y_pred, y)
31     print(t, loss.item())
32 
33     # Before the backward pass, use the optimizer object to zero all of the
34     # gradients for the variables it will update (which are the learnable
35     # weights of the model). This is because by default, gradients are
36     # accumulated in buffers( i.e, not overwritten) whenever .backward()
37     # is called. Checkout docs of torch.autograd.backward for more details.
38     optimizer.zero_grad()
39 
40     # Backward pass: compute gradient of the loss with respect to model
41     # parameters
42     loss.backward()
43 
44     # Calling the step function on an Optimizer makes an update to its
45     # parameters
46     optimizer.step()
Copy the code

  Here, I believe that to deal with the vast majority of applications are possible up. My aim will basically complete. Next step is to deepen understanding of the application.

  

Reference documents:

1 https://blog.csdn.net/kgzhang/article/details/77479737

2 https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_optim.html

Description: Focus on small things systems solutions, good things protocol (wifi, bt) and audio processing algorithms.
Discuss business cooperation and technology: E-mail: [email protected] QQ:. 1173496664 audio market docking technology group: 347 609 188

Guess you like

Origin www.cnblogs.com/jfdwd/p/11239534.html