Deep learning model deployment TensorRT acceleration (10): TensorRT deployment analysis and optimization plan (1)

Chapter 10: TensorRT deployment analysis and optimization plan

Table of contents

Foreword: 

1. Analysis of model deployment indicators

1.1 FLOPS and TOPS

1.2 Roofline model and calculation density

1.3 FP32/FP16/INT8/INT4/FP8 parameters

2. Several major misunderstandings in model deployment

2.1 FLOPS does not measure model performance

2.2 TensorRT is not everything

2.3 CUDA Core vs TensorRT Core

2.4 Deployment disadvantages of 1x1 and depthwise conv

3. Model deployment optimization-quantification

3.1 Basic concepts, advantages and disadvantages of quantification:

3.2 QAT and PTQ

3.3 Quantify Per-tensor and Per-channel

4. Model deployment optimization-pruning

4.1 Basic concepts of pruning

4.2 Common channel pruning and Filter pruning

.3 Common techniques for model pruning

4.4 Overhead during model pruning

Summarize:

PS: It is purely for learning and sharing experience, and does not participate in commercial value operations. If there is any infringement, please contact us in time! ! !

Preview of the next content:

Deep learning model deployment TensorRT acceleration (11): TensorRT deployment analysis and optimization plan (2)


Foreword: 

        Model inference performance analysis: Use tools such as TensorRT Profiler, PyTorch Profiler, TensorFlow Profiler, etc. to conduct detailed analysis of the model's inference performance, including inference time, memory usage, throughput and other indicators. These tools can help identify bottlenecks in the model to optimize the model and system configuration.

1. Analysis of model deployment indicators

1.1 FLOPS and TOPS

        Classic model performance analysis includes two different computing performance indicators: TOPS (Trillions of Operations Per Second ) and FLOPS (Floating-point Operations Per Second) .

1) FLOPS refers to the number of floating point operations per second, which is usually used to measure the general floating point computing performance of a computer. For example, if a processor is capable of performing 1 billion floating point operations, its floating point speed is 1 GFLOPS. In the field of deep learning, FLOPS is also used to measure the computational complexity of neural network models .

2) Computing power TOPS refers to the number of integer and/or fixed-point operations performed per second. It is mainly used to measure the performance of specific functions of the processor, such as convolution calculations in the field of artificial intelligence. Generally, the computing power TOPS of deep learning processors is much higher than the FLOPS of ordinary processors.

        When selecting a processor, you need to select computing power TOPS or FLOPS as the evaluation index according to the specific application. If general floating-point calculations are required, FLOPS is more suitable. If specific types of calculations, such as neural network convolution operations, are required, computing power TOPS is more important.

 

1.2 Roofline model and calculation density

The match between the model and the computing hardware of any model determines the actual performance of the model.

Roofline Model

        The concept of Operational Intensity is mainly proposed, which can calculate the theoretical performance upper limit of the model on the computing platform.

Reference link: Performance analysis of Roofline Model and deep learning model - Zhihu (zhihu.com) [Recommended]

·Definition: The model can achieve the fastest floating point calculation speed under the constraints of a computing platform

·Form: The "rooftop" shape determined by the two parameters of the computing platform's computing power and bandwidth limit.

  • Computing power determines the height of the "roof" (green line segment)

  • The bandwidth determines the slope of the "eaves" (red line segment)

         Through computing performance analysis, we can determine which models are suitable for running and deploying on the corresponding platforms, such as Vgg-16 and MobileNet, and their computing performance can be analyzed and compared on computing platforms such as 1080Ti. Combined with the corresponding task selection. Especially when there is little difference in detection rates, the optimization of computing performance is also a very important entry point.     
 

1.3 FP32/FP16/INT8/INT4/FP8 parameters

· Parameter introduction

  1. FP32 (single precision floating point):

    • FP32 uses 32-bit floating point representation, providing the highest precision and numerical range.

    • It is one of the most commonly used data types in deep learning model training and inference.

    • FP32 models have the highest accuracy, but are more demanding in terms of storage and computing resources.

  2. FP16 (half-precision floating point):

    • FP16 uses 16-bit floating point representation, providing higher computing performance and lower storage requirements.

    • FP16 is suitable for the inference stage of deep learning models and can accelerate the inference process to a certain extent.

    • FP16 may cause performance loss of the model due to reduced accuracy, especially for tasks with complex features and small gradients.

  3. INT8 (8-bit integer):

    • INT8 uses 8-bit integer representation, providing lower storage requirements and computational complexity.

    • INT8 is suitable for the inference stage of deep learning models and reduces the need for computing resources to a certain extent.

    • Due to lower accuracy, INT8 models usually require quantitative training and quantitative inference, which may result in a certain loss of accuracy.

  4. INT4 (4-digit integer):

    • INT4 uses 4-bit integer representation, providing lower storage requirements and computational complexity.

    • INT4 is typically used with highly optimized hardware accelerators, such as NVIDIA Tensor Cores, etc.

    • The INT4 model requires quantitative training and quantitative reasoning, further reducing the accuracy of the model.

  5. FP8 (8-bit floating point number):

    • FP8 uses 8-bit floating point representation, providing storage requirements and computational complexity between FP16 and INT8.

    • FP8 is usually used on some dedicated hardware to balance computing performance and model accuracy.

    • FP8 models require specific hardware and software support to perform calculations and inference on FP8 data types.

In a deep learning deployment, choosing the appropriate data type depends on requirements and hardware platform support. High-precision data types (such as FP32) provide the highest accuracy, but require more computing and storage resources. Lower precision data types such as INT8 can improve inference performance to some extent, but may result in a loss of precision. Therefore, it is necessary to weigh the accuracy, performance and resource requirements, and select the data type suitable for the application scenario.

Among them, low-precision technology (high speed reduced precision) is an advantage of TensorRT deployment. In the training stage, gradient updates are often very small and require relatively high accuracy, generally requiring FP32 or above. However, in the inference stage, when the accuracy requirements are relatively low or the impact is not significant, generally F16 (half-precision) is sufficient, or even INT8 (8-bit integer) can be used, and the accuracy will not be greatly affected. At the same time, low-precision models take up less space, making them easier to deploy in embedded models.

Reference: [TensorRT model conversion and deployment, FP32/FP16/INT8 precision distinction tensorrt half precision BourneA's blog-CSDN blog]

NVIDIA graphics card's support for precision:

  • FP16 (Pascal P100 and V100 (tensor core))

  • INT8 (P4/P40)

2. Several major misunderstandings in model deployment

2.1 FLOPS does not measure model performance

        FLOPS is often used as an indicator to measure the computing power or performance of computing devices (such as CPU, GPU) or deep learning models. A higher value of FLOPS means that the device or model has higher computing power.

        However, although FLOPS can be used as a reference metric, it does not fully measure the performance of the model or the actual inference speed. When actually evaluating model performance, factors such as computing, memory access, data transmission, algorithm complexity, and hardware optimization should be comprehensively considered, and a comprehensive evaluation should be conducted based on specific application scenarios.

2.2 TensorRT is not everything

        TensorRT is an optimization engine that can significantly improve the inference performance of deep learning models. However, it cannot solely rely on performance indicators to evaluate the performance of the model, because the performance of the model is affected by multiple factors, including hardware platform, model characteristics, data set characteristics, and application scenarios.

2.3 CUDA Core vs TensorRT Core
  1. CUDA Core(Compute Unified Device Architecture Core):

    • CUDA Core is the basic computing unit in NVIDIA GPU architecture. It is a GPU hardware unit used to perform parallel computing tasks.

    • CUDA Core performs a single floating-point operation and can improve computational efficiency by processing multiple data elements in parallel.

    • CUDA Core is the basic building block for performing general computing tasks and can be used to write CUDA programs for parallel computing.

  2. TensorRT Core:

    • TensorRT Core is a component in the TensorRT inference engine that is used to accelerate and optimize the inference of deep learning models.

    • TensorRT Core utilizes the characteristics of GPU hardware and deep learning models for inference optimization to improve inference performance and efficiency.

    • TensorRT Core implements various optimization technologies, such as layer fusion, memory optimization, accuracy adjustment, etc., to improve the speed of model inference and resource utilization.

        In summary, CUDA Core is a computing unit in the NVIDIA GPU architecture, used to perform general parallel computing tasks, while TensorRT Core is a component in the TensorRT inference engine, used to accelerate and optimize the inference of deep learning models. CUDA Core is a concept at the underlying hardware level, while TensorRT Core is a concept at the high-level software level. They play different roles at different levels, but all have an important impact on the inference performance of deep learning models.

2.4 Deployment disadvantages of 1x1 and depthwise conv

1x1 convolution and depthwise separable convolution (Depthwise Convolution) are often used in deep learning models, and they have some specific deployment shortcomings. Here are some common disadvantages:

  1. Deployment disadvantages of 1x1 convolution:

    • Computational complexity: Although the computational complexity of 1x1 convolution is relatively small, multiplication and addition operations are still required when processing feature maps, so it may still become a performance bottleneck in some cases.

    • Memory usage: 1x1 convolution needs to store and process a large number of intermediate feature maps in memory, especially when the number of input channels and output channels is large, which may lead to high memory usage.

    • Number of parameters: 1x1 convolution involves a certain number of weight and bias parameters, especially when there are more input channels and output channels, which will increase the number of parameters of the model.

  2. Deployment disadvantages of depthwise separable convolution:

    • Memory footprint: Depthwise separable convolution usually requires storing and processing multiple intermediate feature maps in memory, including the results of depthwise separable convolution and point-wise convolution. This may result in higher memory usage.

    • Network depth: Depthwise separable convolutions are often used to reduce the number of parameters and computational effort, but in some cases it can cause the network to become deeper, thereby increasing the computational and memory overhead of the model.

    • Feature representation ability: Compared with traditional convolution operations, the feature representation ability of depth-separable convolution may be weaker. It is more suitable for tasks with simpler features and may have certain limitations on the capture of complex features.

In summary, 1x1 convolution and depthwise separable convolution have some shortcomings in deep learning models. They may increase computational and memory overhead, cause the network to become more complex, or limit feature representation capabilities. When using these convolution operations, you need to weigh performance, memory, and model expression capabilities, and select and optimize based on the needs of specific tasks.

    

3. Model deployment optimization-quantification

3.1 Basic concepts, advantages and disadvantages of quantification:

        Quantization in TensorRT is an optimization technique used to reduce the storage requirements and computational overhead of the model. Quantization is achieved by reducing the number of bits in the parameters and activation values ​​in the model, thereby reducing the memory footprint and computational complexity of the model.

Reference link: (261 messages) 7. TensorRT Chinese version development tutorial-----Detailed explanation of INT8 quantification in TensorRT tensorrt-int8 quantization Xiao Heshang's blog-CSDN blog

        TensorRT supports the use of 8-bit integers to represent quantized floating point values. The quantization scheme is symmetric uniform quantization - quantized values ​​are represented as signed INT8 , and the conversion from quantized to unquantized values ​​is just a multiplication. In the opposite direction, quantization uses reciprocal scaling, followed by rounding and clamping.

To enable any quantization operations, the INT8 flag must be set in the builder configuration.

· There are two common quantification scale granularities :

Per-tensor quantization : where a single scale value (scalar) is used to scale the entire tensor. Per-channel quantization: Broadcast a scale tensor along a given axis - for convolutional neural networks, this is usually the channel axis. With explicit quantization, weights can be quantized using per-tensor quantization or per-channel quantization. In either case, the scale accuracy is FP32. Activations can only be quantized using per-tensor quantization.

advantage:

  1. Reduced storage requirements: Quantization can significantly reduce the number of parameters and activation values ​​in the model, thereby significantly reducing the storage requirements of the model and saving memory space.

  2. Reduced computational overhead: Low-digit integer operations are more efficient than floating-point operations, which can reduce the computational overhead of the model and improve inference speed and efficiency.

  3. Acceleration hardware support: Some hardware platforms (such as NVIDIA's Tensor Cores) have special acceleration support for integer calculations. These hardware features can be used to improve inference performance through quantization.

shortcoming:

  1. Accuracy loss: The accuracy of the model may be reduced due to the reduction of bits during quantization. Particularly at lower bit quantization, the precision loss may be more noticeable.

  2. Quantitative training requirements: For parameter quantification, quantization-aware training needs to be performed during the training process to maintain the performance and accuracy of the quantized model. This may require additional computational and training costs.

  3. Hardware support limitations: Quantization techniques may require specific hardware support, such as accelerators that support integer arithmetic. Without corresponding hardware support, the advantages of quantification may be reduced.

        To sum up, the quantization technology in TensorRT reduces the storage requirements and computational overhead of the model by reducing the number of bits in parameters and activation values. It can improve inference performance and efficiency, but may lose a certain model accuracy and impose certain requirements on hardware support and training process. When using quantization technology, you need to weigh factors such as accuracy, storage, computing, and hardware support, and choose an appropriate quantization strategy based on specific scenarios.

3.2 QAT and PTQ

        QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) are two common deep learning model quantization technologies.

  1. QAT(Quantization-Aware Training):

    • QAT is a technology that performs quantization-aware training during training , aiming to consider the impact of quantization during training in order to better adapt to the quantized model.

    • In QAT, the model's weights and activations are calculated as floating point numbers during training, but quantization operations and scaling factors are used when calculating gradients and updating parameters .

    • By simulating the effect of quantization during training, QAT enables the model to better adapt to the quantized numerical range, thereby achieving better accuracy and performance during quantitative inference.

import torch
import torch.nn as nn
import torch.quantization

# 定义并训练浮点模型
float_model = MyModel()
# ...

# 将浮点模型转换为量化感知模型
quantized_model = torch.quantization.quantize_qat(float_model, qconfig=torch.quantization.get_default_qat_qconfig('fbgemm'))

# 进行量化感知训练
quantized_model.train()
# ...

# 停止量化感知训练并切换为推理模式
quantized_model.eval()
# ...
  1. PTQ(Post-Training Quantization):

  • PTQ is a technology that quantizes the model after training is completed . It achieves quantization by converting the trained floating-point model into a fixed-point model.

  • In PTQ, the model is trained in the form of floating point numbers during training. After training is completed, the weights and activation values ​​​​of the model are quantized and converted into low-digit integer representations.

  • PTQ is simpler than QAT because it does not require special quantization awareness training during the training process. However, since the influence of quantization is not considered in the training stage, PTQ may require some fine-tuning or calibration to improve the performance and accuracy of the quantized model.

import torch
import torch.nn as nn
import torch.quantization

# 定义并训练浮点模型
float_model = MyModel()
# ...

# 将浮点模型转换为量化模型
quantized_model = torch.quantization.quantize(float_model, qconfig=torch.quantization.get_default_qconfig('fbgemm'))

# 使用量化模型进行推理
quantized_model.eval()
# ...

QAT and PTQ are both techniques for deep learning model quantification. They can reduce the storage requirements and computing overhead of the model to a certain extent, and improve inference performance and efficiency. QAT adapts to quantization by taking the impact of quantization into account during training, while PTQ converts the model to a fixed-point representation after training is complete. Which technology to choose depends on the specific needs and scenarios, as well as the requirements for accuracy and training process.

3.3 Quantify Per-tensor and Per-channel

        Per-tensor (per tensor) and Per-channel (per channel) in quantization technology are two different quantization methods used to specify the quantization granularity of parameters and activation values.

Per-tensor quantization:

  • Per-tensor quantization refers to a unified quantization operation on the entire tensor (such as weight parameters or activation values), that is, all elements use the same scaling factor and quantization parameters.

  • In per-tensor quantization, all elements share the same quantization range and precision, which is suitable for those situations where the elements on each channel or position are not very different.

  • Per-tensor quantization can simplify quantization operations and reduce computing and storage overhead.

Per-channel quantification:

  • Per-channel quantization refers to quantizing each channel in the tensor (such as the input channel or output channel of the convolutional layer) separately, that is, each channel has an independent scaling factor and quantization parameter.

  • In Per-channel quantization, elements of different channels can have different quantization ranges and precisions, which is suitable for situations where the differences between elements between channels are large.

  • Per-channel quantization can provide better accuracy preservation, especially for those models with channel correlation.

        Choosing Per-tensor quantization or Per-channel quantization depends on the specific model and task requirements. In general, if elements at various channels or locations in the model have similar numerical distributions and ranges, per-tensor quantization may be a simplified and efficient choice. For those models with different channel correlations and differences, per-channel quantization can provide better accuracy preservation, but may add some additional computing and storage overhead.

        When implementing quantification, you can choose the appropriate quantification method according to the model structure and requirements, and use the corresponding tools and libraries to implement it. For example, TensorRT provides support for Per-tensor and Per-channel quantization, and automatically selects the best quantization method based on the model structure.

 

4. Model deployment optimization-pruning

4.1 Basic concepts of pruning

Model Pruning is a commonly used model deployment optimization technology that reduces model size, accelerates inference speed, and reduces computing resource requirements by reducing redundant parameters or structures in the model. Pruning technology can usually be divided into the following steps:

  1. Select a pruning strategy: Determine the target to be pruned, such as pruning weights, pruning channels (for convolutional layers), pruning structures (for connections in the network structure), etc. Choosing an appropriate pruning strategy is a critical decision that can be determined based on specific tasks and needs.

  2. Training and evaluation: Train and evaluate using the original model to obtain baseline performance and accuracy. This is for comparison with the pruned model and to ensure that the performance of the pruned model does not drop significantly.

  3. Pruning operation: According to the selected pruning strategy, the pruning operation is performed to prune redundant parameters or structures. Common pruning methods include pruning by threshold ( such as pruning if the weight is less than a certain threshold), pruning proportion (cutting weights proportionally), etc.

  4. Fine-tuning: After pruning, performance may decrease due to changes in model parameters and structure. Therefore, fine-tuning is required, that is, further training on the pruned model to restore and improve performance. Typically, fine-tuning is performed using the original training data set or a smaller data set.

  5. Evaluation and verification: After completing fine-tuning, evaluate and verify the pruning model to ensure that the performance and accuracy meet the requirements. According to requirements, the inference performance and resource usage can be measured on the pruned model to verify the pruning effect.

Pruning technology can significantly reduce model size, accelerate inference speed, and reduce computing resource requirements. However, it should be noted that pruning will introduce sparsity, which may affect the performance of some hardware accelerators. Therefore, when performing pruning optimization, factors such as model size, inference speed, accuracy, and hardware support need to be comprehensively considered to obtain the best deployment optimization effect. Pruning technology can become purely a research topic.

Reference for specific operation examples: (261 messages) YOLOV5 channel pruning [with code]_Meat-loving Peng's blog-CSDN blog

(261 messages) "Model Lightweight-Pruning Distillation Quantification Series" YOLOv5 lossless pruning (with source code)_cvjun's blog-CSDN blog

4.2 Common channel pruning and Filter pruning

Common model pruning methods include channel pruning (Channel Pruning) and filter pruning (Filter Pruning) . They all perform pruning operations on channels (channel pruning) or filters (filter pruning) in convolutional neural networks to reduce the number of parameters and computational overhead of the model.

Channel Pruning: Channel pruning refers to pruning the channels (also called feature maps or output channels) in the convolutional layer, that is, reducing the number of output channels of the convolutional layer. The pruning method can be to select the channel to be pruned based on a certain threshold, or based on the importance evaluation index (such as the L1 norm of the weight, the L2 norm of the gradient, etc.). Channel pruning can reduce computational overhead and reduce the storage requirements of the model to a certain extent, because the pruned model can calculate and store less output features of some channels.

Filter Pruning: Filter pruning refers to pruning the filters (also called convolution kernels or convolution windows) in the convolution layer, that is, reducing the number of filters in the convolution layer. . The method of filter pruning is usually based on the importance evaluation index of the filter, such as the L1 norm of the filter weight, the L2 norm of the gradient, etc. Filter pruning can reduce the number of parameters and calculations of the model, and helps improve the inference speed and efficiency of the model.

Channel pruning and filter pruning can be applied individually or combined to further compress and optimize the model. These pruning techniques can help reduce model size, increase inference speed, and reduce computing resource requirements. When performing pruning operations, factors such as model performance, accuracy loss, and hardware support need to be comprehensively considered to achieve the best pruning effect.

import torch
import torch.nn as nn
import torch.nn.utils.prune as prune

# 定义一个简单的卷积神经网络模型
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3)
        self.fc = nn.Linear(128 * 10 * 10, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# 创建模型实例
model = Net()

# 进行模型训练
# ...

# 定义剪枝比例和剪枝策略
prune_ratio = 0.5
prune_strategy = prune.RandomUnstructured

# 对模型中的某一层进行通道剪枝
module_to_prune = model.conv1
prune.l1_unstructured(module_to_prune, name="weight", amount=prune_ratio)

# 打印剪枝后的模型
print(model)

# 进行剪枝后的模型评估
# ...
.3 Common techniques for model pruning

Reference operation link: (261 messages) Model Optimization Model Pruning_Little Sapling M’s Blog-CSDN Blog

Official tutorial:

Pruning Tutorial — PyTorch Tutorials 2.0.1+cu117 documentation

Model pruning is divided according to structure, mainly including structured pruning and unstructured pruning:

(1) Structural pruning: prune unimportant connections between neuron nodes. This is equivalent to setting a single weight value in the weight matrix to 0.

(2) Unstructured pruning: If a neuron node is removed from the weight matrix, all synapses connected to the neuron must also be removed. It is equivalent to removing a certain row and column in the weight matrix at the same time. How to judge the importance of neuron nodes? You can sort by calculating the root of the square sum of the weight values ​​of the rows and columns corresponding to the neurons, and remove a certain proportion of neuron nodes that are sorted at the bottom.

The following uses a simple case to analyze model pruning:

import torch
from torch import nn

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        self.fc1 = nn.Linear(16 * 5 * 5, 120) 
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, int(x.nelement() / x.shape[0]))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

(1) Partial pruning

def part_cut(model):
    '''
    ######################################局部剪枝#########################################
    剪枝之后会产生一个mask
    剪枝api:prune.random_unstructured(layer1, name="weight", amount=0.3)
            amount:剪枝的比例
            layer1:需要剪的层对象
            name:指定剪的权重还是偏执
    剪枝固化api:prune.remove(layer1, 'weight')
            参数不用过多介绍,功能是剪枝后的模型固化(永久化)
    '''
    layer1 = model.conv1
    print("--------------------------------------剪枝前----------------------------------")
    # print(list(layer1.named_parameters()))
    # print(list(layer1.named_buffers()))
    prune.random_unstructured(layer1, name="weight", amount=0.3)
    print("--------------------------------------剪枝后-----------------------------------")
    # print(list(layer1.named_parameters()))
    # print(list(layer1.named_buffers()))
    prune.remove(layer1, 'weight')
    print("-------------------------------------模型固化后---------------------------------")
    # print(list(layer1.named_parameters()))
    # print(list(layer1.named_buffers()))


    '''-------------------------------------多参数多网络结构剪枝---------------------------------'''
    for name, module in model.named_modules():
        print(name,module)
        # prune 20% of connections in all 2D-conv layers
        if isinstance(module, torch.nn.Conv2d):
            prune.l1_unstructured(module, name='weight', amount=0.2)
            prune.remove(module, 'weight')
        # prune 40% of connections in all linear layers
        elif isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name='weight', amount=0.4)
            prune.remove(module, 'weight')

    print(dict(model.named_buffers()).keys())  # to verify that all masks exist
    return 0

(2) Global pruning:

def glob_cut(model):
    '''
    全局剪枝:
    '''
    parameters_to_prune = (
        (model.conv1, 'weight'),
        (model.conv2, 'weight'),
        (model.fc1, 'weight'),
        (model.fc2, 'weight'),
        (model.fc3, 'weight'),
    )

    prune.global_unstructured(
        parameters_to_prune,
        pruning_method=prune.L1Unstructured,
        amount=0.6,
    )
    print(list(model.named_parameters()))
    print(list(model.named_buffers()))
4.4 Overhead during model pruning

During the pruning process, some additional computing and storage overhead will be introduced, which is called overhead during the pruning process. The following are some overheads that may occur during the pruning process:

  1. Additional computational overhead: The pruning process usually involves calculating importance evaluation metrics for model parameters, such as the L2 norm of gradients or the L1 norm of weights. These evaluation metrics require additional computational overhead, especially in large models.

  2. Additional storage overhead: During the pruning process, additional information needs to be stored, such as pruning proportion, pruning strategy, and model structure after pruning. This information may increase the storage overhead of the model, especially when the pruning ratio is large.

  3. Retraining or fine-tuning overhead: After pruning, in order to restore model performance or improve accuracy, retraining or fine-tuning is usually required. This introduces additional training iterations and computational overhead to re-tune the parameters of the pruned model.

  4. Changes in inference performance: pruned models may introduce sparsity, which may have an impact on the performance of some hardware accelerators (such as GPUs). Sparsity may lead to a decrease in parallel computing efficiency or be unfavorable to the application of certain optimization techniques, thereby affecting inference performance.

Although there is some overhead in the pruning process, its impact can be reduced through appropriate strategies and optimization. For example, efficient pruning algorithms and index evaluation methods can be selected to reduce additional computing overhead. In addition, pruned models can be optimized for inference performance through quantization, compression, or support from specific hardware accelerators.

When performing pruning optimization, it is necessary to comprehensively consider the overhead and pruning effects caused by pruning, as well as the requirements of specific tasks and hardware platforms. By weighing these factors, the best pruning strategies and methods can be selected to obtain the best model compression and optimization effects.

 

Summarize:

        After the above learning, I have mastered tensorrt’s optimization plan and optimization indicators. Different optimization solutions can be adopted to improve the model in a targeted manner. The next chapter will introduce how to use the API provided by tensorrt to obtain performance data and improve deployment solutions! ! ! !

PS: It is purely for learning and sharing experience, and does not participate in commercial value operations. If there is any infringement, please contact us in time! ! !

Preview of the next content:

  • Deep learning model deployment TensorRT acceleration (11): TensorRT deployment analysis and optimization plan (2)

Guess you like

Origin blog.csdn.net/chenhaogu/article/details/132684052