parameter adjustment optimizer pytorch

torch.optim

  1. How to use the optimizer
  2. algorithm
  3. How to adjust learning rate

torch.optimIt is to achieve various optimization algorithms package. The most common methods are already supported, the interface is very conventional, so the future can be easily integrated and more sophisticated methods.

How to use the optimizer

To use torch.optim, you must construct an optimizerobject. This object can save the current parameter status and update parameters based on the calculated gradient

Construct

To construct a Optimizer, you must give it a parameter contained (must all Variableobjects) optimized. You can then specify optimizerthe parameter options, such as learning rate, weight attenuation.

example:

optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
optimizer = optim.Adam([var1, var2], lr = 0.0001)

Set individually for each parameter options

OptimizerAlso supports setting options for each parameter individually. To do so, do not direct the incoming Variableof iterable, but passed dict的iterable. Each dictrespectively defines a set of parameters, and comprises a paramkey, the key corresponding to the list of parameters. Other key should be optimizeraccepted by other parameters keyword matches, and will be used to optimize the parameters of this group.

note:

You can still set the option as a keyword argument. They will be used as the default value, in the group does not cover them. When you want to change an option, while keeping all other options between the parameter group, which is useful.

For example, when we want to specify for each layer of the learning rate, which is very useful:

optim.SGD([
            {'params': model.base.parameters()},
            {'params': model.classifier.parameters(), 'lr': 1e-3}
            ], lr=1e-2, momentum=0.9)

This means that the model.baseparameters will use the default learning rate 1e-2, model.classifierthe parameters will use the learning rate 1e-3, and 0.9the momentumwill be used for all parameters.

Make a single optimization

All optimizerwill implement step()methods to update parameters. It can be used in two ways:

optimizer.step()

This is most optimizersupported by simplified version. Once the gradient is as backward()the computing function like well, we can call this function.

example

for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()

optimizer.step(closure)

Some optimization algorithms for example Conjugate Gradient, and LBFGSneed to be repeated several times calculation function, so you need to pass a closure to allow them to recalculate your model. This closure will clear gradient, calculation of the loss, and then returns.

example:

for input, target in dataset:
    def closure(): optimizer.zero_grad() output = model(input) loss = loss_fn(output, target) loss.backward() return loss optimizer.step(closure)

algorithm

class torch.optim.Optimizer(params, defaults)

Optimization of the base class for all.

parameter:

  1. params (iterable) - may be iterative Variable or  dict. Specify which variables should be optimized.
  2. defaults- (dict): dict defaults include optimization options (set a parameter is not specified parameters option will use the default value).
load_state_dict(state_dict)

Load optimizerstate

parameter:

  1. - state_dict (dict)  optimizerstate. It should be state_dict()the object returned by the call.
state_dict()

The state returns to an optimizer dict.

It contains two elements:

  1. state - hold the current optimizationstate dict. It contains the differences between the optimized class.
  2. param_groups - a group that contains all the parameters dict.
step(closure)

Perform a single optimization step (parameter update).

parameter:

  1. closure (callable, optional) - return loss and re-evaluate the model closure. hon zero_grad ()

Clear all optimized Variablegradient.

class torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)

Implement Adadeltaalgorithms.

ADADELTA proposes an adaptive learning rate method .

parameter:

  1. params (iterable) - may be iterated to optimize the parameter or parameters defined
  2. rho (float, optional) - running average coefficient (default value: 0.9) calculating the square of the gradient
  3. eps (float, optional) - added to increase the denominator term (Default: 1e-6) Stability value
  4. lr (float, optional) - delta applied prior to scaling coefficient parameter (default value: 1.0)
  5. weight_decay (float, optional) - weight decay (L2 norm) (default: 0)
step(closure)

Perform a single optimization step.

parameter:

  1. closure (callable, optional) - return loss and re-evaluate the model closure.
class torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0)

Achieve Adagrad algorithm.

In online learning and adaptation subgradient stochastic optimization method is proposed in.

parameter:

  1. params (iterable) - may be iterated to optimize the parameter or parameters defined
  2. lr (float, optional) - learning rate (default: 1e-2)
  3. lr_decay (float, optional) - learning rate decay (default: 0)
  4. weight_decay (float, optional) - weight decay (L2 norm) (default: 0)
step(closure)

Perform a single optimization step.

parameter:

  1. closure (callable, optional) - return loss and re-evaluate the model closure.
class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]

Adam implement the algorithm.

It Adam: A Method for Stochastic Optimization been proposed in.

parameter:

  1. params (iterable) - may be iterated to optimize the parameter or parameters defined
  2. lr (float, optional) - learning rate (default: 1e-3)
  3. betas (Tuple [float, float], optional) - and for calculating the running average gradient squared coefficient (default: 0.9,0.999)
  4. eps (float, optional) - increases the value of the denominator in order to improve numerical stability (default: 1e-8)
  5. weight_decay (float, optional) - weight decay (L2 norm) (default: 0)
step(closure) 

Perform a single optimization step.

parameter:

  1. closure (callable, optional) - return loss and re-evaluate the model closure.
class torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

Implement Adamaxalgorithm (a kind of variant of Adam infinite norm-based).

It Adam: A Method for Stochastic Optimization been proposed in.

parameter:

  1. params (iterable) - may be iterated to optimize the parameter or parameters defined
  2. lr (float, optional) - learning rate (default: 2e-3)
  3. betas (Tuple [float, float], optional) - running average gradient coefficients and for calculating the square of the gradient
  4. eps (float, optional) - increases the value of the denominator in order to improve numerical stability (default: 1e-8)
  5. weight_decay (float, optional) - weight decay (L2 norm) (default: 0)
step(closure=None)

Perform a single optimization step.

parameter:

  1. closure (callable, optional) - return loss and re-evaluate the model closure.
class torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)

To achieve an average stochastic gradient descent.

It is of stochastic approximation by averaging Acceleration was proposed.

parameter:

  1. params (iterable) - may be iterated to optimize the parameter or parameters defined
  2. lr (float, optional) - learning rate (default: 1e-2)
  3. lambd (float, optional) - decay period (Default: 1e-4)
  4. alpha (float, optional) - eta updated Index (default: 0.75)
  5. t0 (float, optional) - specifies which first started in average (default: 1e6)
  6. weight_decay (float, optional) - weight decay (L2 norm) (default: 0)
step(closure)

Perform a single optimization step.

parameter:

  1. closure (callable, optional) - return loss and re-evaluate the model closure.
class torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100, line_search_fn=None)

L-BFGS algorithm to achieve.

Warning: The optimizer does not support individual parameter settings for each option and does not support the parameter set (only one) on all parameters must be now in a single device. It will be improved in the future.

Note: This is a highly dense memory optimizer (it requires additional param_bytes * (history_size + 1) bytes). If it is not suited to memory, try to reduce the history size, or use a different algorithm.

parameter:

  1. lr (float) - learning rate (default: 1)
  2. max_iter (int) - The maximum number of iterations per optimization step (default: 20))
  3. max_eval (int) - Each function evaluation maximum number of optimization steps (default: max * 1.25)
  4. tolerance_grad (float) - a termination order optimal tolerance (default: 1e-5)
  5. tolerance_change (float) - Function / Parameter Change terminated tolerance (default: 1e-9)
  6. history_size (int) - update history size (default: 100)
step(closure)

Perform a single optimization step.

parameter:

  1. closure (callable, optional) - return loss and re-evaluate the model closure.
class torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)[source]

Implement RMSpropalgorithms.

It made G. Hintonin his proposed course .

Center version first appeared in Generating Sequences With Recurrent Neural Networks .

parameter:

  1. params (iterable) - may be iterated to optimize the parameter or parameters defined
  2. lr (float, optional) - learning rate (default: 1e-2)
  3. momentum (float, optional) - momentum factor (default: 0)
  4. alpha (float, optional) - smoothing constant (default: 0.99)
  5. eps (float, optional) - increases the value of the denominator in order to improve numerical stability (default: 1e-8)
  6. centered (bool, optional) - If the gradient to be estimated by the variance True, the Computing Center of the RMSProp normalized
  7. weight_decay (float, optional) - weight decay (L2 norm) (default: 0)
step(closure)

Perform a single optimization step.

parameter:

  1. closure (callable, optional) - return loss and re-evaluate the model closure.
class torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))

Achieve resilient back-propagation algorithm.

parameter:

  1. params (iterable) - may be iterated to optimize the parameter or parameters defined
  2. lr (float, optional) - learning rate (default: 1e-2)
  3. etas (Tuple [float, float], optional) - one pair (etaminus, etaplis), which is increasing and decreasing the multiplier factor (default: 0.5,1.2)
  4. step_sizes (Tuple [float, float], optional) - allows a pair of the minimum and maximum step size (default: 1e-6,50)
step(closure) 

Perform a single optimization step.

parameter:

  1. closure (callable, optional) - return loss and re-evaluate the model closure.
class torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)

Implemented stochastic gradient descent algorithm (Momentum optional).

Nesterov momentum based On the importance of initialization and momentum in deep learning of formulas.

parameter:

  1. params (iterable) - may be iterated to optimize the parameter or parameters defined
  2. lr (float) - learning rate
  3. momentum (float, optional) - momentum factor (default: 0)
  4. weight_decay (float, optional) - weight decay (L2 norm) (default: 0)
  5. dampening (float, optional) - momentum inhibitory factor (default: 0)
  6. nesterov (bool, optional) - Use Nesterov momentum (default: False)

example:

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward() >>> optimizer.step()

prompt:

The momentum achieved with SGD / Nesterov is slightly different from the others, and achieve other Sutskever frame. Momentum into account the specific situation, the update may be written as v = ρ * v + gp = p-lr * v where, p, g, v and [rho] are the parameters, the gradient, velocity and momentum. This is in contrast to Sutskever et. Al. And other forms of the Frame Update v = ρ * v + lr * gp = p-v Nesterov similarly modified version.

step(closure) 

Perform a single optimization step.

parameter:

  1. closure (callable, optional) - return loss and re-evaluate the model closure.

How to adjust learning rate

torch.optim.lr_scheduler It offers several ways to adjust the number of epoches of learning rate. torch.optim.lr_scheduler.ReduceLROnPlateauAllowing authentication based on some measure to reduce dynamic learning rate.

class torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)

The learning rate parameter for each group is set to an initial lr multiplied by a given function. When last_epoch = -1, the initial set lr lr.

parameter:

  1. optimizer (Optimizer) - Packaging optimizer.
  2. lr_lambda (function or list) - a function to calculate a multiplication factor of a given integer parameters epoch, and the like, or a list of functions for each group optimizer.param_groups.
  3. last_epoch (int) - index of the last period. Default: -1.

example:

>>> # Assuming optimizer has two groups.
>>> lambda1 = lambda epoch: epoch // 30 >>> lambda2 = lambda epoch: 0.95 ** epoch >>> scheduler = LambdaLR(optimizer, lr_lambda=[lambda1, lambda2]) >>> for epoch in range(100): >>> scheduler.step() >>> train(...) >>> validate(...)
class torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1)

The learning rate parameter for each group is set to an initial period of lr each step_size gamma attenuation. When last_epoch = -1, the initial set lr lr.

  1. optimizer (Optimizer) - Packaging optimizer.
  2. step_size (int) - learning rate decay period.
  3. gamma (float) - the product of the attenuation factor learning rate. Default: -0.1.
  4. last_epoch (int) - Index last of an era. Default: 1.

example:

>>> # Assuming optimizer uses lr = 0.5 for all groups
>>> # lr = 0.05     if epoch < 30
>>> # lr = 0.005 if 30 <= epoch < 60 >>> # lr = 0.0005 if 60 <= epoch < 90 >>> # ... >>> scheduler = StepLR(optimizer, step_size=30, gamma=0.1) >>> for epoch in range(100): >>> scheduler.step() >>> train(...) >>> validate(...)
class torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1)

Once the amount of time to reach a milestone, then each of the learning rate parameter group is set to an initial value of a gamma attenuation. When last_epoch = -1, the initial set lr lr.

parameter:

  1. optimizer (Optimizer) - Packaging optimizer.
  2. milestones (list) - a list of indicators period. It must be increased.
  3. gamma (float) - the product of the attenuation factor learning rate. Default: -0.1.
  4. last_epoch (int) - Index last of an era. Default: -1.

example:

>>> # Assuming optimizer uses lr = 0.5 for all groups
>>> # lr = 0.05     if epoch < 30
>>> # lr = 0.005 if 30 <= epoch < 80 >>> # lr = 0.0005 if epoch >= 80 >>> scheduler = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1) >>> for epoch in range(100): >>> scheduler.step() >>> train(...) >>> validate(...)
class torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1)

The learning rate for each parameter group is set to initial lr every era of decay. When last_epoch = -1, the initial set lr lr.

  1. optimizer (Optimizer) - Packaging optimizer.
  2. gamma (float) - the product of the attenuation factor learning rate.
  3. last_epoch (int) - the last index. Default: -1.
class torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)

When the indicator stops to improve and reduce the learning rate. When learning stagnant, models tend to make learning speed by 2-10 times. The scheduler reads an indicator of the amount, if not increase the number of epochs, learning rate will be reduced.

  1. optimizer (Optimizer) - Packaging optimizer.
  2. . Mode (str) - min, max is a mode at the minimum, when the amount of monitoring stopped falling, lr will be reduced; at maximum mode, while monitoring the amount of the increase is stopped, will decrease. Default: 'min'.
  3. factor (float) - factors that make learning rate decrease. new_lr = lr * factor default: 0.1.
  4. patience (int) -epochs no improvement after learning rate will be reduced. Default: 10.
  5. verbose (bool) - If True, it will print a message to stdout each update. Default: False.
  6. threshold (float) - the best measure of the value of the new threshold, only concerned about significant change. Default: 1e-4.
  7. threshold_mode (str) -. rel, abs rel in one model, dynamic_threshold = Best  (+ threshold. 1) in 'max' Best MODE or  - the minimal model (1 threshold) in the absolute value of the model, dynamic_threshold = best +. . threshold pattern at the maximum or optimum value of the minimum threshold default mode: 'rel'.
  8. cooldown (int) - Number of times before resuming normal operation after reducing lr wait. Default: 0.
  9. min_lr (float or list) - a list of scalar or scalar. A lower limit of the learning rate for all or each of the groups. Default: 0.
  10. eps (float) - applies to the minimum attenuation lr. If the difference between the old and the new lr less than eps, the update will be ignored. Default: 1e-8.
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> scheduler = torch.optim.ReduceLROnPlateau(optimizer, 'min') >>> for epoch in range(10): >>> train(...) >>> val_loss = validate(...) >>> # Note that step should be called after validate() >>> scheduler.step(val_loss)

Guess you like

Origin www.cnblogs.com/jfdwd/p/11240967.html