Sparsity in INT8: Accelerated Training Workflows and NVIDIA TensorRT Best Practices

Sparsity in INT8: Accelerated Training Workflows and NVIDIA TensorRT Best Practices


The training phase of a deep learning (DL) model involves learning a large number of dense floating-point weight matrices, which results in a large number of floating-point calculations during inference. Research has shown that many of these calculations can be skipped by forcing some weights to zero with little impact on the final accuracy.

In the meantime, previous posts have shown that lower precision (eg INT8) is often sufficient to achieve similar precision to FP32 during inference. Sparsity and quantization are popular optimization techniques used to address these issues, improving inference time and reducing memory footprint.

NVIDIA TensorRT has provided quantization support for some time (starting with version 2.1), and support for sparsity has recently been built into the NVIDIA Ampere architecture Tensor Cores and introduced in TensorRT 8.0.

This post is a step-by-step guide on how to accelerate DL models with TensorRT using sparsity and quantization techniques. Although each of these optimizations has been discussed individually, it is still necessary to show the end-to-end workflow from training to deployment with TensorRT, considering both optimizations.

In this post, we aim to bridge this gap and help you understand what a sparse quantization training workflow looks like, recommend sparsity best practices for TensorRT acceleration, and show an end-to-end case for ResNet-34 Research.

Sparse structure

NVIDIA Sparse Tensor Cores use a 2:4 pattern, which means that two of each consecutive block of four values ​​must be zero. In other words, we follow a 50% fine-grained structured sparsity recipe and do not perform any computation on zero values ​​due to the support available directly on Tensor Cores. This results in more work to be computed in the same amount of time. In this article, we refer to this process as pruning.

For more information, see Accelerating Inference Through Sparsity with NVIDIA Ampere Architecture and NVIDIA TensorRT .

Quantify

Quantization is the process of mapping a continuous infinite value to a finite set of discrete values ​​(for example, FP32 to INT8). There are two main quantization techniques discussed in this post:

  • Post-Training Quantization (PTQ): Uses an implicit quantization workflow. In implicitly quantized networks, each quantized tensor has an associated scale that is used to implicitly quantize and dequantize values ​​through calibration. TensorRT then checks which precision the layer runs faster at and executes it accordingly.
  • Quantization-Aware Training (QAT): Use an explicit quantization workflow. Explicit quantization networks utilize quantization and dequantization (Q/DQ) nodes to explicitly indicate which layers must be quantized. This means you have more control over which layers run in INT8. See Q/DQ Layer Layout Recommendations for details .

For quantization basics, a comparison between PTQ and QAT quantization techniques, insights on when to choose which quantization, and more information on quantization in TensorRT, see Achieving FP32 Accuracy for INT8 Inference Using Quantization-Aware Training with NVIDIA TensorRT .

Workflow for deploying sparsely quantized models in TensorRT

The workflow for deploying sparse quantization models in TensorRT, with PyTorch as the DL framework, has the following steps:

  • Sparsification and fine-tuning of pretrained dense models in PyTorch.
  • Quantify sparsified models through PTQ or QAT workflows.
  • Deploy the obtained sparse INT8 engine in TensorRT.
    The figure below shows all three steps. One difference in step 2 is that the Q/DQ nodes are present in the ONNX graph generated by QAT, but not in the ONNX graph generated by PTQ. See Using INT8 for details .

With that in mind, here's the full workflow for QAT:

  • Sparsification and fine-tuning of pretrained dense models in PyTorch.
  • Quantizing, calibrating and fine-tuning sparse models in PyTorch.
  • Export PyTorch models to ONNX.
  • Generate TensorRT engine through ONNX.
  • Deploy the obtained Sparse INT8 engine in TensorRT.

On the other hand, here is the complete workflow of PTQ:

  • Sparsification and fine-tuning of pretrained dense models in PyTorch.
  • Export PyTorch models to ONNX.
  • Calibrate and quantize the sparse ONNX model through the TensorRT builder to generate a TensorRT engine.
  • Deploy the obtained sparse INT8 engine in TensorRT.

insert image description here

Case Study: ResNet-34

This section demonstrates a case study of a sparse quantization workflow using ResNet-34. For more information, see the full code example in the /SparsityINT8 GitHub repository.

Require

Here is the basic configuration needed to complete this case study:

  • Python 3.8
  • PyTorch 1.11 (also tested with 2.0.0)
  • PyTorch vision
  • apex sparsity toolkit
  • pytorch-quantization toolkit
  • TensorRT 8.6
  • Polygraphy
  • ONNX setup >=13
  • NVIDIA Ampere architecture GPU for Tensor Core support

This case study requires image classification using the ImageNet 2012 dataset. For more information on downloading the dataset and converting it to the desired format, see the readme on the GitHub repository .

This dataset is required for sparse training, sparse QAT model fine-tuning, and sparse PTQ model calibration. It is also used to evaluate the model.

Step 1: Sparsification and fine-tuning from dense model

Load pretrained dense models and augment models and optimizers for sparse training. See the NVIDIA/apex/tree/master/apex/contrib/sparsity folder for details .

import copy
from torchvision import models
from apex.contrib.sparsity import ASP
 
# Load dense model
model_dense = models.__dict__["resnet34"](pretrained=True)
 
# Initialize sparsity mode before starting sparse training
model_sparse = copy.deepcopy(model_dense)
ASP.prune_trained_model(model_sparse, optimizer)
 
# Re-train model
for e in range(0, epoch):
    for i, (image, target) in enumerate(data_loader):
        image, target = image.to(device), target.to(device)
        output = model_sparse(image)
        loss = criterion(output, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
# Save model
torch.save(model_sparse.state_dict(), "sparse_finetuned.pth")

Step 2: Quantize the PyTorch model

You can choose between two quantization methods for this step: PTQ or QAT.

PTQ is calibrated by TensorRT

This option exports the PyTorch model to ONNX and calibrates it via the TensorRT Python API. This generates a calibration cache and a TensorRT engine ready to deploy.

Export a sparse PyTorch model to ONNX:

dummy_input = torch.randn(batch_size, 3, 224, 224, device="cuda")
torch.onnx.export(model_sparse, dummy_input, "sparse_finetuned.onnx", opset_version=13, do_constant_folding=True)

Calibrate the ONNX model exported in the previous step using the calibration dataset. The following code examples assume an ONNX model with a static input shape and batch size.

from infer_engine import infer
from polygraphy.backend.trt import Calibrator, CreateConfig, EngineFromNetwork, NetworkFromOnnxPath, TrtRunner, SaveEngine
from polygraphy.logger import G_LOGGER
 
 
# Data loader argument to `Calibrator` 
def calib_data(val_batches, input_name):
    for iteration, (images, labels) in enumerate(val_batches):
        yield {
    
    input_name: images.numpy()}
 
# Set path to ONNX model
onnx_path = "sparse_finetuned.onnx"
 
# Set calibrator
calibration_cache_path = onnx_path.replace(".onnx", "_calibration.cache")
calibrator = Calibrator(
    data_loader=calib_data(data_loader_calib, args.onnx_input_name), 
    cache=calibration_cache_path
)
 
# Build engine from ONNX model by enabling INT8 and sparsity weights, and providing the calibrator
build_engine = EngineFromNetwork(
    NetworkFromOnnxPath(onnx_path),
    config=CreateConfig(
        int8=True,
        calibrator=calibrator,
        sparse_weights=True
    )
)
 
# Trigger engine saving
engine_path = onnx_path.replace(".onnx", ".engine")
build_engine = SaveEngine(build_engine, path=engine_path)
 
# Calibrate engine (activated by the runner)
with G_LOGGER.verbosity(G_LOGGER.VERBOSE), TrtRunner(build_engine) as runner:
    print("Calibrated engine!")
 
    # Infer PTQ engine and evaluate its accuracy
    log_file = engine_path.split("/")[-1].replace(".engine", "_accuracy.txt")
    infer(
        engine_path,
        data_loader_test,
        batch_size=args.batch_size,
        log_file=log_file
    )

QAT through the pytorch-quantization toolkit

This option uses the pytorch-quantization toolkit to add a Q/DQ node to a sparse PyTorch model, calibrate it, and fine-tune it for a few epochs. The fine-tuned model is then exported to ONNX and converted to TensorRT engine for deployment.

To ensure that the already computed sparse floating point weights are not overwritten and that the QAT weights will also be structured to be sparse, you have to prepare the model again for pruning.

Initialize the QAT model and optimizer for pruning before loading the fine-tuned sparse weights. Sparse mask recalculation must also be disabled, since they were already computed in step 1. This requires a custom function that is a slight modification of the APEX toolkit's prune_trained_model function. Modifications are highlighted in the code sample:

from apex.contrib.sparsity import ASP
 
def prune_trained_model_custom(model, optimizer, compute_sparse_masks=False):
    asp = ASP()
    asp.init_model_for_pruning(model, mask_calculator="m4n2_1d", verbosity=2, whitelist=[torch.nn.Linear, torch.nn.Conv2d], allow_recompute_mask=False)
    asp.init_optimizer_for_pruning(optimizer)
    if compute_sparse_masks:
        asp.compute_sparse_masks()

In order to optimize Q/DQ node placement , you have to modify the definition of the model to quantize the residual branch, as shown in the pytorch-quantization toolkit example. For example, for ResNet, the modifications required to add Q/DQ nodes in the residual branch are highlighted below:


from pytorch_quantization import nn as quant_nn
 
class BasicBlock(nn.Module):
 
   def __init__(self, ..., quantize: bool = False) -> None:
       super().__init__()
       ...
       if self._quantize:
            self.residual_quantizer = quant_nn.TensorQuantizer(quant_nn.QuantConv2d.default_quant_desc_input)
 
   def forward(self, x: Tensor) -> Tensor:
       identity = x
       ...
       if self._quantize:
           out += self.residual_quantizer(identity)
       else:
           out += identity
       out = self.relu(out)
       return out

The same modification must be repeated for the Bottleneck class, and the quantization bool parameter must be propagated through the ResNet, _resnet and resnet34 functions. After making these modifications, instantiate the model with quantize=True. See line 734 in resnet.py for details .

The first step in quantizing a sparse model with QAT is to enable quantization and pruning in the model. The second step is to load the fine-tuned sparse checkpoint, calibrate it, and finally fine-tune the model for a few epochs. See calibrate_quant_resnet50.ipynb for more information on the collect_stats and compute_amax functions .

# Add Q/DQ nodes to the dense model
from pytorch_quantization import quant_modules
quant_modules.initialize()
model_qat = models.__dict__["resnet34"](pretrained=True, quantize=True)
 
# Initialize sparsity mode before starting Sparse-QAT fine-tuning
prune_trained_model_custom(model_qat, optimizer, compute_sparse_masks=False)
 
# Load Sparse weights
load_dict = torch.load("sparse_finetuned.pth")
model_qat.load_state_dict(load_dict["model_state_dict"])
 
# Calibrate model
collect_stats(model_qat, data_loader_calib, num_batches=len(data_loader_calib))
compute_amax(model_qat, method="entropy”)
 
# Fine-tune model
for e in range(0, epoch):
    for i, (image, target) in enumerate(data_loader):
        image, target = image.to(device), target.to(device)
        output = model_qat(image)
        ...
 
# Save model
torch.save(model_qat.state_dict(), "quant_finetuned.pth")

To prepare the TensorRT Engine for deployment, you must export the sparse quantized PyTorch model to ONNX. TensorRT expects the QAT ONNX model to indicate which layers should be quantized via a set of QuantizeLinear and DequantizeLinear ONNX operations. This requirement is met by enabling pseudo-quantization when exporting quantized PyTorch models to ONNX.

from pytorch_quantization import nn as quant_nn
quant_nn.TensorQuantizer.use_fb_fake_quant = True
dummy_input = torch.randn(batch_size, 3, 224, 224, device="cuda")
torch.onnx.export(model_qat, dummy_input, "quant_finetuned.onnx", opset_version=13, do_constant_folding=True)

Finally, build the TensorRT engine:

$ trtexec --onnx=quant_finetuned.onnx --int8 --sparsity=enable --saveEngine=quant_finetuned.engine --skipInference

Step 3: Deploy the TensorRT engine

$ trtexec --loadEngine=quant_finetuned.engine -v

result

Below are performance measurements in terms of classification accuracy and runtime for ResNet-34 densely quantized and sparsely quantized models on an NVIDIA A40 GPU with TensorRT 8.6-GA (8.6.1.6). To reproduce these results, follow the workflow described in the previous section.

The figure below shows the dense vs. sparse accuracy of ResNet-34 in TensorRT under three settings:

  • Dense vs. sparse in FP32
  • Dense PTQ vs. sparse PTQ in INT8
  • Dense QAT vs. sparse QAT in INT8

As you can see, the sparse variant maintains accuracy to a large extent compared to the dense variant for all settings.

Guess you like

Origin blog.csdn.net/kunhe0512/article/details/131028623