[Model Accelerated Deployment] - Pytorch Automatic Mixed Precision Training

Automatic Mixed Precision

torch.amp provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype, while others use a lower precision float datatype (lower_precision_fp): torch.float16 (half) or torch .bfloat16. Some operations, like linear layers and convolutions, are much faster in lower_precision_fp. Other operations, like downscaling, often require the dynamic range of float32. Mixed precision tries to match each operation to its appropriate data type.

Typically, "Automatic Mixed Precision Training" with data type torch.float16 uses torch.autocast and torch.cuda.amp.GradScaler together. And torch.autocast and torch.cuda.amp.GradScaler are modular and can be used separately if needed.

For CUDA and CPU, the API is also provided separately:

torch.autocast("cuda", args...) is equivalent to torch.cuda.amp.autocast(args...) .

torch.autocast("cpu", args...) is equivalent to torch.cpu.amp.autocast(args...) . For CPU, currently only the lower precision floating point data type of torch.bfloat16 is supported.

Automatic conversion (autocast)

torch.autocast(device_type, dtype=None, enabled=True, cache_enabled=None)

Parameters:
device_type (str, required) - use 'cuda' or 'cpu' device
enabled (bool, optional) - whether to enable autocasting in the zone. Default: True
dtype (torch_dtype, optional) – Whether to use torch.float16 or torch.bfloat16.
cache_enabled (bool, optional) - Whether to enable autocast internal weight caching. Default: True

autocast can be used as a context manager or decorator, allowing script scopes to run in mixed precision.

In these regions, data operations are run on the dtype chosen by autocast that is specific to that operation, to improve performance while maintaining accuracy.

Tensors can be of any type when entering an autocast-enabled region. When using autocast, you should not call half() or bfloat16() on the model or input. autocast should only wrap the forward pass of the network, including loss computation. Backpassing under autocast is not recommended. Backward operations take the same type of run that autocast uses for the corresponding forward operations.

CUDA example:

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

for input, target in data:
    optimizer.zero_grad()

    # Enables autocasting for the forward pass (model + loss)
    with autocast():
        output = model(input)
        loss = loss_fn(output, target)

    # Exits the context manager before backward()
    loss.backward()
    optimizer.step()

The resulting floating point tensor may be float16 in regions where autoconversion is enabled. Using them with floating point tensors of different dtypes may result in type mismatch errors after returning to a region where automatic conversion is disabled. If so, convert the resulting tensor in the auto-conversion zone back to float32 (or other dtype, if needed). If the tensor from the auto-converted region is already float32, the conversion is a no-op and incurs no additional overhead.

CUDA example:

# Creates some tensors in default dtype (here assumed to be float32)
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

with autocast():
    # torch.mm is on autocast's list of ops that should run in float16.
    # Inputs are float32, but the op runs in float16 and produces float16 output.
    # No manual casts are required.
    e_float16 = torch.mm(a_float32, b_float32)
    # Also handles mixed input types
    f_float16 = torch.mm(d_float32, e_float16)

# After exiting autocast, calls f_float16.float() to use with d_float32
g_float32 = torch.mm(d_float32, f_float16.float())

CPU training example:

# Creates model and optimizer in default precision
model = Net()
optimizer = optim.SGD(model.parameters(), ...)

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
            output = model(input)
            loss = loss_fn(output, target)

        loss.backward()
        optimizer.step()

CPU inference example:

# Creates model in default precision
model = Net().eval()

with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
    for input in data:
        # Runs the forward pass with autocasting.
        output = model(input)

Inference example of CPU using jit trace:

class TestModel(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, num_classes)
    def forward(self, x):
        return self.fc1(x)

input_size = 2
num_classes = 2
model = TestModel(input_size, num_classes).eval()

# For now, we suggest to disable the Jit Autocast Pass,
# As the issue: https://github.com/pytorch/pytorch/issues/75956
torch._C._jit_set_autocast_mode(False)

with torch.cpu.amp.autocast(cache_enabled=False):
    model = torch.jit.trace(model, torch.randn(1, input_size))
model = torch.jit.freeze(model)
# Models Run
for _ in range(3):
    model(torch.randn(1, input_size))

autocast(enabled=False) subregions can be nested within autocast enabled regions. Partially disabling autocast is useful when forcing subregions on specific data types. Disabling autocast gives explicit control over the execution type. In subregions, input from surrounding regions should be converted to the specified data type before being used.

# 在默认数据类型(这里假设为float32)中创建一些张量
a_float32 = torch.rand((8, 8), device="cuda")
b_float32 = torch.rand((8, 8), device="cuda")
c_float32 = torch.rand((8, 8), device="cuda")
d_float32 = torch.rand((8, 8), device="cuda")

with autocast():
    e_float16 = torch.mm(a_float32, b_float32)
    with autocast(enabled=False):
        # 调用e_float16.float()以确保使用float32执行
        # (这是必需的,因为e_float16是在autocast区域中创建的)
        f_float32 = torch.mm(c_float32, e_float16.float())

    # 当重新进入启用autocast的区域时,无需手动转换类型。
    # torch.mm仍然以float16运行并产生float16的输出,不受输入类型的影响。
    g_float16 = torch.mm(d_float32, f_float32)

The autocast state is thread-local. If you want to enable it in a new thread, you must call the context manager or decorator in that thread.

torch.cuda.amp.autocast(enabled=True, dtype=torch.float16, cache_enabled=True)
torch.cuda.amp.autocast(args…)等同于torch.autocast(“cuda”, args…)

* torch.cuda.amp.custom_fwd(fwd=None, , cast_inputs=None)
Auxiliary decorator for the forward method of the custom automatic derivation function (subclass of torch.autograd.Function).

Parameters:
cast_inputs (torch.dtype or None, optional, default=None) - If not None, when running forward in an autocast-enabled region, convert the incoming floating point CUDA tensor to the target data type (non-floating point Tensors are unaffected), then perform forward with autocast disabled. If None, forward's internal operations will be performed according to the current autocast state

Gradient Scaling

If the forward pass of an op has float16 inputs, the backward pass of that op will produce float16 gradients. Gradient values ​​with small magnitudes may not be representable as float16. These values ​​will be zeroed ("underflow"), so updates to the corresponding parameters will be lost.

To prevent underflow, "gradient scaling" works by multiplying the network's loss by a scaling factor and performing a back pass on the scaled loss. Gradients propagated back through the network are also scaled by the same factor. In other words, the gradient values ​​have large magnitudes, so they are not zeroed out.

Gradients (.grad properties) for each parameter should be restored before the optimizer updates the parameters to ensure that the scaling factor does not interfere with the learning rate setting.

torch.cuda.amp.GradScaler(init_scale=65536.0, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True)

get_backoff_factor() returns a Python float containing the scaling backoff factor.

get_growth_factor() returns a Python float containing the scaling growth factor.

get_growth_interval() returns a Python integer containing the growth interval.

get_scale() returns a Python float containing the current scaling factor, or 1.0 if scaling is disabled.

**WARNING:** get_scale() will generate CPU-GPU synchronization.

is_enabled() returns a boolean indicating whether this instance is enabled.

load_state_dict(state_dict) load scaler state. If this instance is disabled, load_state_dict() does nothing.

Parameters: state_dict (dict) – state of the scaler. Should be the object returned by calling state_dict().

scale(outputs) scales a tensor or list of tensors by a scale factor.

Return the scaled output. Returns unmodified output if the instance of GradScaler is not enabled.

Parameters: outputs (Tensor or iterable of Tensors) – outputs to scale.

set_backoff_factor(new_factor) Parameters: new_factor (float) – value to use as new scaling backoff factor.

set_growth_factor(new_factor) Parameters: new_factor (float) – value to use as new scaling growth factor.

set_growth_interval(new_interval) Parameters: new_interval (int) – value to use as new growth interval.

state_dict() returns the state of the scaler as a dictionary. It contains five entries:

"scale" - a Python float containing the current scale

"growth_factor" - a Python float containing the current growth factor

"backoff_factor" - a Python float containing the current backoff factor

"growth_interval" - a Python integer containing the current growth interval

"_growth_tracker" - a Python integer containing the number of the most recent consecutive unskipped steps.

Returns an empty dictionary if this instance is not enabled.

Note: If you want to check the state of the point scaler after a specific iteration, you should call state_dict() after update().

step(optimizer, *args, **kwargs)

step() performs the following two operations:

1 calls unscale_(optimizer) internally (unless unscale_() was explicitly called before in the iteration). As part of unscale_(), it is checked whether the gradient contains inf/NaN.

2 If no inf/NaN gradients are found, optimizer.step() is called with unscaled gradients. Otherwise, optimizer.step() will be skipped to avoid corrupting parameters.

*args and **kwargs will be passed to optimizer.step().

Return the return value of optimizer.step(*args, **kwargs).

Parameters: optimizer (torch.optim.Optimizer) – the optimizer to apply gradients to.

args – any arguments.

kwargs – Any keyword arguments.

warn

Closure usage is currently not supported.

unscale_(optimizer) divides ("unscales") the optimizer's gradient tensor by a scaling factor.

unscale_() is optional and is suitable for cases where gradients are modified or inspected between backpropagation and step ( step() ). Gradients will be automatically unscaled during a step ( step() ) if unscale_() is not called explicitly.

Simple example, using unscale_() to enable clipping of unscaled gradients:

… scaler.scale(loss).backward() scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) scaler.step(optimizer) scaler.update() Parameters: optimizer (torch. optim.Optimizer) – the optimizer with gradients to unscale.

Notice

unscale_() does not generate CPU-GPU synchronization.

warn

Each call to unscale_() should be called exactly once per step() call per optimizer, and only after all gradients have been accumulated for that optimizer's assigned parameters. Calling unscale_() twice in succession for a given optimizer between each step() will raise a RuntimeError.

warn

unscale_() may unscale sparse gradients in an irreversible way, replacing the .grad attribute.

update(new_scale=None)[SOURCE] Update the scaling factor.

If any optimizer steps are skipped, it is reduced by backoff_factor multiplied by the scaling factor. If there are growth_interval unskipped iterations in a row, increase it by multiplying growth_factor by the scaling factor.

Passing new_scale manually sets the new scale value. (new_scale is not used directly, but is used to populate GradScaler's internal scaling tensor. Therefore, if new_scale is a tensor, in-place changes to that tensor will not further affect the scale used internally by GradScaler.)

Parameters: new_scale (float or torch.cuda.FloatTensor, optional, default None) – new scaling factor.

warn

update() should only be called at the end of an iteration where scaler.step(optimizer) was called for all used optimizers.

Autocast Op related reference

Autocast Op Eligibility

Regardless of whether autocast is enabled, operations that operate on float64 or non-float types are not autocasted, they will be performed on the original type. autocast only affects out-of-place operations and tensor methods. In-place operations and calls that explicitly provide out=...tensor are allowed in autocast-enabled regions, but they will not pass autocast. For example, in a region where auto-transition is enabled, a.addmm(b,c) can be automatically converted, but a.addmm_(b,c,out=d) cannot. For best performance and stability, prefer out-of-place operations in autocast-enabled regions. Operations that explicitly call dtype=... are not eligible for autocast use, and will generate output for the dtype argument.

CUDA Op-specific behavior

The following list describes the behavior of eligible operations in regions where autotransition is enabled. These operations are always automatically converted, whether they are called as part of a torch.nn.Module, as a function, or as a torch.Tensor method. If functions are exposed in multiple namespaces, they are automatically converted regardless of the namespace.

Operations not listed below are not automatically converted. They operate according to the type defined by their input. However, automatic conversions may still change the type of operations run by unlisted operations if they are downstream of the automatically converted operation.

If an operation is not listed, we assume it is numerically stable in float16. Please file an issue if you think an operation not listed is numerically unstable in float16.

CUDA Ops that can automatically convert to float16

matmul、addbmm、addmm、addmv、addr、baddbmm、bmm、chain_matmul、multi_dot、conv1d、conv2d、conv3d、conv_transpose1d、conv_transpose2d、conv_transpose3d、GRUCell、linear、LSTMCell、matmul、mm、mv、prelu、RNNCell

CUDA Ops that can automatically convert to float32

pow、rdiv、rpow、rtruediv、acos、asin、binary_cross_entropy_with_logits、cosh、cosine_embedding_loss、cdist、cosine_similarity、cross_entropy、cumprod、cumsum、dist、erfinv、exp、expm1、group_norm、hinge_embedding_loss、kl_div、l1_loss、layer_norm、log、log_softmax、log10、log1p、log2、margin_ranking_loss、mse_loss、multilabel_margin_loss、multi_margin_loss、nll_loss、norm、normalize、pdist、poisson_nll_loss、pow、prod、reciprocal、rsqrt、sinh、smooth_l1_loss、soft_margin_loss、softmax、softmin、softplus、sum、renorm、tan、triplet_margin_loss

CUDA Ops that can boost to the widest range of input types

These operations do not require a specific data type for stability, but require multiple inputs and require that the data types of the inputs match. If all inputs are float16, the operation operates in float16. If any of the inputs are float32, autoconvert converts all inputs to float32 and runs the operation in float32.

addcdiv、addcmul、atan2、bilinear、cross、dot、grid_sample、index_put、scatter_add、tensordot

Some operations not listed here (e.g., binary operations such as add) can promote input by themselves without the intervention of automatic conversion. If the input is a mixture of bfloat16 and float32, these operations will operate in float32 and produce float32 output, whether or not automatic conversion is enabled.

Prefer binary_cross_entropy_with_logits over binary_cross_entropy

The backpropagation of torch.nn.functional.binary_cross_entropy() (and torch.nn.BCELoss which wraps it) may produce gradients that cannot be represented in float16. In regions where autoconversion is enabled, the forward input may be float16, which means that the gradients for backpropagation must be representable in float16 (autoconverting a float16 forward input to float32 is useless because the conversion must be in is reversed in backpropagation). Therefore, binary_cross_entropy and BCELoss throw an error in regions where autotransition is enabled.

Many models use a sigmoid layer before the binary cross-entropy layer. In this case, use torch.nn.functional.binary_cross_entropy_with_logits() or torch.nn.BCEWithLogitsLoss to combine the two layers. binary_cross_entropy_with_logits and BCEWithLogits are safe to auto-convert.

CPU Op Specific Behavior

The following list describes the behavior of eligible operations in regions where autotransition is enabled. These operations are always automatically converted, whether they are called as part of a torch.nn.Module, as a function, or as a torch.Tensor method. If functions are exposed in multiple namespaces, they are automatically converted regardless of the namespace.

Operations not listed below are not automatically converted. They operate according to the type defined by their input. However, automatic conversions may still change the type of operations run by unlisted operations if they are downstream of the automatically converted operation.

If an operation is not listed, we assume it is numerically stable in bfloat16. Please file an issue if you think an operation not listed is numerically unstable in bfloat16.

CPU Ops that can be automatically converted to bfloat16

conv1d、conv2d、conv3d、bmm、mm、baddbmm、addmm、addbmm、linear、matmul、_convolution

CPUs that can automatically convert to float32

Ops conv_transpose1d、conv_transpose2d、conv_transpose3d、avg_pool3d、binary_cross_entropy、grid_sampler、grid_sampler_2d、_grid_sampler_2d_cpu_fallback、grid_sampler_3d、polar、prod、quantile、nanquantile、stft、cdist、trace、view_as_complex、cholesky、cholesky_inverse、cholesky_solve、inverse、lu_solve、orgqr、inverse、ormqr、pinverse、max_pool3d、max_unpool2d、max_unpool3d、adaptive_avg_pool3d、reflection_pad1d、reflection_pad2d、replication_pad1d、replication_pad2d、replication_pad3d、mse_loss、ctc_loss、kl_div、multilabel_margin_loss、fft_fft、fft_ifft、fft_fft2、fft_ifft2、fft_fftn、fft_ifftn、fft_rfft、fft_irfft、fft_rfft2、fft_irfft2、fft_rfftn、fft_irfftn、fft_hfft、fft_ihfft、linalg_matrix_norm、linalg_cond、linalg_matrix_rank、linalg_solve、linalg_cholesky、linalg_svdvals、linalg_eigvals、linalg_eigvalsh、linalg_inv、linalg_householder_product、linalg_tensorinv、linalg_tensorsolve、fake_quantize_per_tensor_affine、eig、geqrf、lstsq、_lu_with_info、qr、solve、svd、symeig、triangular_solve、fractional_max_pool2d、fractional_max_pool3d、adaptive_max_pool3d、multilabel_margin_loss_forward、linalg_qr、linalg_cholesky_ex、linalg_svd、linalg_eig、linalg_eigh、linalg_lstsq、linalg_inv_ex

CPU Ops that can be boosted to the widest input type

These operations do not require a specific data type for stability, but require multiple inputs and require that the data types of the inputs match. If all inputs are bfloat16, the operation operates in bfloat16. If any one input is float32, autoconvert converts all inputs to float32 and runs the operation in float32.

cat、stack、index_copy

Some operations not listed here (e.g., binary operations such as add) can promote input by themselves without the intervention of automatic conversion. If the input is a mixture of bfloat16 and float32, these operations will operate in float32 and produce float32 output, whether or not automatic conversion is enabled.

Supplement: Understanding of inplace and out of inplace (understand inplace to understand)

inplace=True means to perform in-place operation, and choose to perform in-place coverage operation. For example, x+=1 is to operate on the original value x, and then directly overwrite the value with the obtained result. y=x+5, x=y is not an in-place operation on x.
The advantage of the inplace=True operation is that it can save computing memory and save other irrelevant variables.
Note: When inplace=True is used, the tensor passed down from the upper network will be directly modified to change the input data. The specific meaning is shown in the following example:

import torch
import torch.nn as nn

relu = nn.ReLU(inplace=True)
input = torch.randn(7)

print("输入数据:",input)

output = relu(input)
print("ReLU输出:", output)

print("ReLU处理后,输入数据:")
print(input)

torch.autograd.gradfunction is one of the functions used in PyTorch to calculate gradients. It is used to compute the gradient of one or more scalar functions with respect to a set of variables.

The function signature is as follows:

mathematicaCopy code
torch.autograd.grad(outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False)

Parameter Description:

  • outputs: Tensor or list of tensors containing scalar functions whose gradients need to be computed.
  • inputs: Tensor or list of tensors of variables whose gradients need to be computed.
  • grad_outputs: outputsTensor or list of tensors with the same shape, used to specify the outer gradient when computing the gradient. Defaults to None, which means to use unit gradient (i.e. 1).
  • retain_graph: Boolean value, specifies whether to keep the calculation graph for subsequent calculations after the gradient is calculated. The default is None, which means to automatically judge whether to keep the calculation graph.
  • create_graph: Boolean value, specifying whether to create a new computation graph for computing higher-order derivatives. The default is False.
  • only_inputs: boolean, specifies whether to only compute the gradient of the input. Defaults to True, meaning only the gradient of the input is computed.
  • allow_unused: boolean specifying whether to allow unused inputs when computing gradients. The default is False, which means that unused input is not allowed.

The function returns a inputstensor or list of tensors with the same shape as , representing inputsthe gradient relative to . If no gradient is required for an input, the gradient for the corresponding position will be None.

Here is an example usage:

pythonCopy codeimport torch

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2
grads = torch.autograd.grad(y, x)

print(grads)  # 输出 [tensor([4.])]

In the above example, we calculated the gradient y = x ** 2with respect to x, and torch.autograd.gradobtained the result through the function. In this example, gradsa value of 4.0 means that the gradient yrelative xto is 4.0.

Guess you like

Origin blog.csdn.net/qq_43456016/article/details/132168036