Deep Learning Practice - Model Reasoning Optimization Exercise

Series of experiments
Deep learning practice - Convolutional neural network practice: Crack identification
Deep learning practice - Recurrent neural network practice
Deep learning practice - Model deployment optimization practice
Deep learning practice - Model reasoning optimization practice


Source address: https://pan.baidu.com/s/1PuWZF2DkG0-F5pQLMIkTcQ?pwd=c24s

Model inference optimization exercise

Architecture Design Exercise

Through code modification, explore StudentNetthe impact of each parameter on the model parameter quantity.

The optimized compression in architecture design is mainly carried out by reducing the parameters of the neural network. Here, the model can be compressed and optimized by increasing or decreasing the number of channels and pruning the number of channels. In the source code given by the website, the model provides two parameters to adjust the channel. The first is the baseparameter, which is directly used to define the initial number of neuron channels. Secondly width_mult, this parameter is the pruning control factor, when it is 1, it means no pruning. Number of channels after pruning = number of channels before pruning * width_mult.

According to the understanding of the parameters, it can be known that the smaller the base, the smaller the model compression, and the smaller the width_mult, the smaller the compression. Let's verify the hypothesis by modifying the code.

  • Default parameter output

    First output the default value of the neural network layer and the parameter size for:

    The main code is as follows, and the complete code can be found in架构设计练习.py

    model_default = StudentNet()
    model_default.eval()
    summary(model_default.to('cuda:0'), input_size=(3, 128, 128))
    

    The output corresponding to the above code is as follows,

    ----------------------------------------------------------------
            Layer (type)               Output Shape         Param #
    ================================================================
                Conv2d-1         [-1, 16, 128, 128]             448
           BatchNorm2d-2         [-1, 16, 128, 128]              32
                 ReLU6-3         [-1, 16, 128, 128]               0
             MaxPool2d-4           [-1, 16, 64, 64]               0
                Conv2d-5           [-1, 16, 64, 64]             160
           BatchNorm2d-6           [-1, 16, 64, 64]              32
                 ReLU6-7           [-1, 16, 64, 64]               0
                Conv2d-8           [-1, 32, 64, 64]             544
             MaxPool2d-9           [-1, 32, 32, 32]               0
               Conv2d-10           [-1, 32, 32, 32]             320
          BatchNorm2d-11           [-1, 32, 32, 32]              64
                ReLU6-12           [-1, 32, 32, 32]               0
               Conv2d-13           [-1, 64, 32, 32]           2,112
            MaxPool2d-14           [-1, 64, 16, 16]               0
               Conv2d-15           [-1, 64, 16, 16]             640
          BatchNorm2d-16           [-1, 64, 16, 16]             128
                ReLU6-17           [-1, 64, 16, 16]               0
               Conv2d-18          [-1, 128, 16, 16]           8,320
            MaxPool2d-19            [-1, 128, 8, 8]               0
               Conv2d-20            [-1, 128, 8, 8]           1,280
          BatchNorm2d-21            [-1, 128, 8, 8]             256
                ReLU6-22            [-1, 128, 8, 8]               0
               Conv2d-23            [-1, 256, 8, 8]          33,024
               Conv2d-24            [-1, 256, 8, 8]           2,560
          BatchNorm2d-25            [-1, 256, 8, 8]             512
                ReLU6-26            [-1, 256, 8, 8]               0
               Conv2d-27            [-1, 256, 8, 8]          65,792
               Conv2d-28            [-1, 256, 8, 8]           2,560
          BatchNorm2d-29            [-1, 256, 8, 8]             512
                ReLU6-30            [-1, 256, 8, 8]               0
               Conv2d-31            [-1, 256, 8, 8]          65,792
               Conv2d-32            [-1, 256, 8, 8]           2,560
          BatchNorm2d-33            [-1, 256, 8, 8]             512
                ReLU6-34            [-1, 256, 8, 8]               0
               Conv2d-35            [-1, 256, 8, 8]          65,792
    AdaptiveAvgPool2d-36            [-1, 256, 1, 1]               0
               Linear-37                   [-1, 11]           2,827
    ================================================================
    Total params: 256,779
    Trainable params: 256,779
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.19
    Forward/backward pass size (MB): 13.13
    Params size (MB): 0.98
    Estimated Total Size (MB): 14.29
    ----------------------------------------------------------------
    
  • The result of lowering the base value

    model_base12 = StudentNet(base=12)
    model_base12.eval()
    summary(model_base12.to('cuda:0'), input_size=(3, 128, 128))
    

    The result is as follows:

    ----------------------------------------------------------------
            Layer (type)               Output Shape         Param #
    ================================================================
                Conv2d-1         [-1, 12, 128, 128]             336
           BatchNorm2d-2         [-1, 12, 128, 128]              24
                 ReLU6-3         [-1, 12, 128, 128]               0
             MaxPool2d-4           [-1, 12, 64, 64]               0
                Conv2d-5           [-1, 12, 64, 64]             120
           BatchNorm2d-6           [-1, 12, 64, 64]              24
                 ReLU6-7           [-1, 12, 64, 64]               0
                Conv2d-8           [-1, 24, 64, 64]             312
             MaxPool2d-9           [-1, 24, 32, 32]               0
               Conv2d-10           [-1, 24, 32, 32]             240
          BatchNorm2d-11           [-1, 24, 32, 32]              48
                ReLU6-12           [-1, 24, 32, 32]               0
               Conv2d-13           [-1, 48, 32, 32]           1,200
            MaxPool2d-14           [-1, 48, 16, 16]               0
               Conv2d-15           [-1, 48, 16, 16]             480
          BatchNorm2d-16           [-1, 48, 16, 16]              96
                ReLU6-17           [-1, 48, 16, 16]               0
               Conv2d-18           [-1, 96, 16, 16]           4,704
            MaxPool2d-19             [-1, 96, 8, 8]               0
               Conv2d-20             [-1, 96, 8, 8]             960
          BatchNorm2d-21             [-1, 96, 8, 8]             192
                ReLU6-22             [-1, 96, 8, 8]               0
               Conv2d-23            [-1, 192, 8, 8]          18,624
               Conv2d-24            [-1, 192, 8, 8]           1,920
          BatchNorm2d-25            [-1, 192, 8, 8]             384
                ReLU6-26            [-1, 192, 8, 8]               0
               Conv2d-27            [-1, 192, 8, 8]          37,056
               Conv2d-28            [-1, 192, 8, 8]           1,920
          BatchNorm2d-29            [-1, 192, 8, 8]             384
                ReLU6-30            [-1, 192, 8, 8]               0
               Conv2d-31            [-1, 192, 8, 8]          37,056
               Conv2d-32            [-1, 192, 8, 8]           1,920
          BatchNorm2d-33            [-1, 192, 8, 8]             384
                ReLU6-34            [-1, 192, 8, 8]               0
               Conv2d-35            [-1, 192, 8, 8]          37,056
    AdaptiveAvgPool2d-36            [-1, 192, 1, 1]               0
               Linear-37                   [-1, 11]           2,123
    ================================================================
    Total params: 147,563
    Trainable params: 147,563
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.19
    Forward/backward pass size (MB): 9.85
    Params size (MB): 0.56
    Estimated Total Size (MB): 10.60
    ----------------------------------------------------------------
    

    It can be seen that compared with the default value, the number of variables in the network layer is reduced, the network layer sends changes, and the model is compressed. Then reduce the base value in turn, and draw the following figure with the model as the dependent variable base value as the independent variable.

    [External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-5yJ5K1o0-1690719177660) (D:\Learning materials\Junior 3\Junior 3\Junior 3 \Deep Learning Practice\Assignment\Experiment\Experiment 5\Document Records.assets\image-20221203181501324.png)]

    It can be seen that the size of the model is basically proportional to the base value.

  • Result of lowering width_mult value

    model_mul0_8 = StudentNet(width_mult=0.8)
    model_mul0_8.eval()
    summary(model_mul0_8.to('cuda:0'), input_size=(3, 128, 128))
    

    The result is as follows:

    ----------------------------------------------------------------
            Layer (type)               Output Shape         Param #
    ================================================================
                Conv2d-1         [-1, 16, 128, 128]             448
           BatchNorm2d-2         [-1, 16, 128, 128]              32
                 ReLU6-3         [-1, 16, 128, 128]               0
             MaxPool2d-4           [-1, 16, 64, 64]               0
                Conv2d-5           [-1, 16, 64, 64]             160
           BatchNorm2d-6           [-1, 16, 64, 64]              32
                 ReLU6-7           [-1, 16, 64, 64]               0
                Conv2d-8           [-1, 32, 64, 64]             544
             MaxPool2d-9           [-1, 32, 32, 32]               0
               Conv2d-10           [-1, 32, 32, 32]             320
          BatchNorm2d-11           [-1, 32, 32, 32]              64
                ReLU6-12           [-1, 32, 32, 32]               0
               Conv2d-13           [-1, 64, 32, 32]           2,112
            MaxPool2d-14           [-1, 64, 16, 16]               0
               Conv2d-15           [-1, 64, 16, 16]             640
          BatchNorm2d-16           [-1, 64, 16, 16]             128
                ReLU6-17           [-1, 64, 16, 16]               0
               Conv2d-18          [-1, 102, 16, 16]           6,630
            MaxPool2d-19            [-1, 102, 8, 8]               0
               Conv2d-20            [-1, 102, 8, 8]           1,020
          BatchNorm2d-21            [-1, 102, 8, 8]             204
                ReLU6-22            [-1, 102, 8, 8]               0
               Conv2d-23            [-1, 204, 8, 8]          21,012
               Conv2d-24            [-1, 204, 8, 8]           2,040
          BatchNorm2d-25            [-1, 204, 8, 8]             408
                ReLU6-26            [-1, 204, 8, 8]               0
               Conv2d-27            [-1, 204, 8, 8]          41,820
               Conv2d-28            [-1, 204, 8, 8]           2,040
          BatchNorm2d-29            [-1, 204, 8, 8]             408
                ReLU6-30            [-1, 204, 8, 8]               0
               Conv2d-31            [-1, 204, 8, 8]          41,820
               Conv2d-32            [-1, 204, 8, 8]           2,040
          BatchNorm2d-33            [-1, 204, 8, 8]             408
                ReLU6-34            [-1, 204, 8, 8]               0
               Conv2d-35            [-1, 256, 8, 8]          52,480
    AdaptiveAvgPool2d-36            [-1, 256, 1, 1]               0
               Linear-37                   [-1, 11]           2,827
    ================================================================
    Total params: 179,637
    Trainable params: 179,637
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.19
    Forward/backward pass size (MB): 12.72
    Params size (MB): 0.69
    Estimated Total Size (MB): 13.59
    ----------------------------------------------------------------
    

    [External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-54YtLAKD-1690719177661) (D:\Learning materials\Junior 3\Junior 3\Junior 3 \Deep Learning Practice\Assignment\Experiment\Experiment 5\Document Records.assets\image-20221203182537114.png)]

    It can be seen that the size of the model is basically proportional to the width_mul value, but its compression range is limited relative to the base value.

knowledge distillation exercise

It can be seen from the case that the performance of the distilled student model is much lower than that of the pre-trained teacher model. Please analyze the reasons and explore ways to further improve the performance of the student model. .

reason:

From the case on the website, it can be seen that the student network has been trained for many rounds, and theoretically it should be similar to the accuracy of the teacher network, but the results show that it is still much worse. There are two major differences between the student network and the teacher network. One of them is that the teacher network has been fully trained, while the student network has not been trained at the beginning; the second is that the structure of the student network and the teacher network are not consistent.

For the first difference, it can be eliminated by sufficient training through knowledge distillation, but the second cannot. Therefore, one of the reasons why its performance is not as good as that of the teacher network should be its network structure. Then print the structure of the teacher network and the student network for comparison, and 知识蒸馏.pyprint through the following code (see the specific code).

teacher_net = models.resnet18(pretrained=False, num_classes=11)
teacher_net.load_state_dict(torch.load(f'./teacher_resnet18.bin'))
student_net = StudentNet(base=16)
print("teacher Net")
summary(teacher_net.to('cuda:0'), input_size=(3, 128, 128))
print("\n\n\nstudent Net")
summary(student_net.to('cuda:0'), input_size=(3, 128, 128))
  • teacher network

    ----------------------------------------------------------------
            Layer (type)               Output Shape         Param #
    ================================================================
                Conv2d-1           [-1, 64, 64, 64]           9,408
           BatchNorm2d-2           [-1, 64, 64, 64]             128
                  ReLU-3           [-1, 64, 64, 64]               0
             MaxPool2d-4           [-1, 64, 32, 32]               0
                Conv2d-5           [-1, 64, 32, 32]          36,864
           BatchNorm2d-6           [-1, 64, 32, 32]             128
                  ReLU-7           [-1, 64, 32, 32]               0
                Conv2d-8           [-1, 64, 32, 32]          36,864
           BatchNorm2d-9           [-1, 64, 32, 32]             128
                 ReLU-10           [-1, 64, 32, 32]               0
           BasicBlock-11           [-1, 64, 32, 32]               0
               Conv2d-12           [-1, 64, 32, 32]          36,864
          BatchNorm2d-13           [-1, 64, 32, 32]             128
                 ReLU-14           [-1, 64, 32, 32]               0
               Conv2d-15           [-1, 64, 32, 32]          36,864
          BatchNorm2d-16           [-1, 64, 32, 32]             128
                 ReLU-17           [-1, 64, 32, 32]               0
           BasicBlock-18           [-1, 64, 32, 32]               0
               Conv2d-19          [-1, 128, 16, 16]          73,728
          BatchNorm2d-20          [-1, 128, 16, 16]             256
                 ReLU-21          [-1, 128, 16, 16]               0
               Conv2d-22          [-1, 128, 16, 16]         147,456
          BatchNorm2d-23          [-1, 128, 16, 16]             256
               Conv2d-24          [-1, 128, 16, 16]           8,192
          BatchNorm2d-25          [-1, 128, 16, 16]             256
                 ReLU-26          [-1, 128, 16, 16]               0
           BasicBlock-27          [-1, 128, 16, 16]               0
               Conv2d-28          [-1, 128, 16, 16]         147,456
          BatchNorm2d-29          [-1, 128, 16, 16]             256
                 ReLU-30          [-1, 128, 16, 16]               0
               Conv2d-31          [-1, 128, 16, 16]         147,456
          BatchNorm2d-32          [-1, 128, 16, 16]             256
                 ReLU-33          [-1, 128, 16, 16]               0
           BasicBlock-34          [-1, 128, 16, 16]               0
               Conv2d-35            [-1, 256, 8, 8]         294,912
          BatchNorm2d-36            [-1, 256, 8, 8]             512
                 ReLU-37            [-1, 256, 8, 8]               0
               Conv2d-38            [-1, 256, 8, 8]         589,824
          BatchNorm2d-39            [-1, 256, 8, 8]             512
               Conv2d-40            [-1, 256, 8, 8]          32,768
          BatchNorm2d-41            [-1, 256, 8, 8]             512
                 ReLU-42            [-1, 256, 8, 8]               0
           BasicBlock-43            [-1, 256, 8, 8]               0
               Conv2d-44            [-1, 256, 8, 8]         589,824
          BatchNorm2d-45            [-1, 256, 8, 8]             512
                 ReLU-46            [-1, 256, 8, 8]               0
               Conv2d-47            [-1, 256, 8, 8]         589,824
          BatchNorm2d-48            [-1, 256, 8, 8]             512
                 ReLU-49            [-1, 256, 8, 8]               0
           BasicBlock-50            [-1, 256, 8, 8]               0
               Conv2d-51            [-1, 512, 4, 4]       1,179,648
          BatchNorm2d-52            [-1, 512, 4, 4]           1,024
                 ReLU-53            [-1, 512, 4, 4]               0
               Conv2d-54            [-1, 512, 4, 4]       2,359,296
          BatchNorm2d-55            [-1, 512, 4, 4]           1,024
               Conv2d-56            [-1, 512, 4, 4]         131,072
          BatchNorm2d-57            [-1, 512, 4, 4]           1,024
                 ReLU-58            [-1, 512, 4, 4]               0
           BasicBlock-59            [-1, 512, 4, 4]               0
               Conv2d-60            [-1, 512, 4, 4]       2,359,296
          BatchNorm2d-61            [-1, 512, 4, 4]           1,024
                 ReLU-62            [-1, 512, 4, 4]               0
               Conv2d-63            [-1, 512, 4, 4]       2,359,296
          BatchNorm2d-64            [-1, 512, 4, 4]           1,024
                 ReLU-65            [-1, 512, 4, 4]               0
           BasicBlock-66            [-1, 512, 4, 4]               0
    AdaptiveAvgPool2d-67            [-1, 512, 1, 1]               0
               Linear-68                   [-1, 11]           5,643
    ================================================================
    Total params: 11,182,155
    Trainable params: 11,182,155
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.19
    Forward/backward pass size (MB): 20.50
    Params size (MB): 42.66
    Estimated Total Size (MB): 63.35
    ----------------------------------------------------------------
    
  • student network

    ----------------------------------------------------------------
            Layer (type)               Output Shape         Param #
    ================================================================
                Conv2d-1         [-1, 16, 128, 128]             448
           BatchNorm2d-2         [-1, 16, 128, 128]              32
                 ReLU6-3         [-1, 16, 128, 128]               0
             MaxPool2d-4           [-1, 16, 64, 64]               0
                Conv2d-5           [-1, 16, 64, 64]             160
           BatchNorm2d-6           [-1, 16, 64, 64]              32
                 ReLU6-7           [-1, 16, 64, 64]               0
                Conv2d-8           [-1, 32, 64, 64]             544
             MaxPool2d-9           [-1, 32, 32, 32]               0
               Conv2d-10           [-1, 32, 32, 32]             320
          BatchNorm2d-11           [-1, 32, 32, 32]              64
                ReLU6-12           [-1, 32, 32, 32]               0
               Conv2d-13           [-1, 64, 32, 32]           2,112
            MaxPool2d-14           [-1, 64, 16, 16]               0
               Conv2d-15           [-1, 64, 16, 16]             640
          BatchNorm2d-16           [-1, 64, 16, 16]             128
                ReLU6-17           [-1, 64, 16, 16]               0
               Conv2d-18          [-1, 128, 16, 16]           8,320
            MaxPool2d-19            [-1, 128, 8, 8]               0
               Conv2d-20            [-1, 128, 8, 8]           1,280
          BatchNorm2d-21            [-1, 128, 8, 8]             256
                ReLU6-22            [-1, 128, 8, 8]               0
               Conv2d-23            [-1, 256, 8, 8]          33,024
               Conv2d-24            [-1, 256, 8, 8]           2,560
          BatchNorm2d-25            [-1, 256, 8, 8]             512
                ReLU6-26            [-1, 256, 8, 8]               0
               Conv2d-27            [-1, 256, 8, 8]          65,792
               Conv2d-28            [-1, 256, 8, 8]           2,560
          BatchNorm2d-29            [-1, 256, 8, 8]             512
                ReLU6-30            [-1, 256, 8, 8]               0
               Conv2d-31            [-1, 256, 8, 8]          65,792
               Conv2d-32            [-1, 256, 8, 8]           2,560
          BatchNorm2d-33            [-1, 256, 8, 8]             512
                ReLU6-34            [-1, 256, 8, 8]               0
               Conv2d-35            [-1, 256, 8, 8]          65,792
    AdaptiveAvgPool2d-36            [-1, 256, 1, 1]               0
               Linear-37                   [-1, 11]           2,827
    ================================================================
    Total params: 256,779
    Trainable params: 256,779
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.19
    Forward/backward pass size (MB): 13.13
    Params size (MB): 0.98
    Estimated Total Size (MB): 14.29
    ----------------------------------------------------------------
    

    From the above two output results, it can be seen that the student network is relatively small compared to the teacher network. The teacher network has a total of 11,182,155 parameters, while the student network has only 256,779. And we know that the more variables the model has, the better the fitting effect should be. Then we can see that the teacher network is obviously more than the student network, so the accuracy rate must be higher than that of the student network.

Methods to improve the performance of the student model

  • For the improvement of the student model, we can see from the analysis of the reasons above. If the network structure can be modified, it can be improved from the network structure.
  • Secondly, we can find a more powerful teacher model for knowledge distillation to achieve higher accuracy.
  • It is also possible to perform more intensive training by increasing the data set.
  • Find better results by tweaking parameters.

Model pruning exercise

For the example in Type Single Module Clipping, method clipping is performed on conv1it .biasL1unstructured

For the pruning of bias, you only need to change the specified value relative to the weight. The pruning code is basically the same as that on the website, and only the following parts are changed (see the specific code) 模型剪枝1.py:

module = model.conv1
print(module.bias)
prune.l1_unstructured(module,name="bias",amount=0.3)
print(module.bias)

After running the above code, we can get,

Parameter containing:
tensor([-0.2817, -0.0636,  0.0237,  0.2616, -0.3117, -0.0650], device='cuda:0',
       requires_grad=True)
tensor([-0.2817, -0.0000,  0.0000,  0.2616, -0.3117, -0.0650], device='cuda:0j',
       grad_fn=<MulBackward0>)

It can be seen that except for the 2nd and 3rd numbers which are changed to 0 and cut off, the others remain unchanged.

In actual combat cases, does batchsize have an impact on cropping performance? What about other hyperparameters?

Adjust the clipping parameters one by one and carry out the following experiments. For relevant codes, see模型剪枝2.py

  • The impact of batch size

    First, the data set is reduced, and then the batchsize is modified to 24, 48, and 72 to compare the output pruning results. The specific code can be seen 模型剪枝2.py. The results obtained are as follows:

    The result of the network after construction:

    • The estimated size of the model when batchsize is 72 is 52.85MB
    • The estimated size of the model when the batchsize is 48 is 52.85MB
    • The estimated size of the model when the batchsize is 24 is 52.85MB

    It can be found that batchsize has no effect on the effect of pruning.

  • The impact of prune_rate

    Divide prune_rate into 0.75, 0.85, 0.95 for experiment

    The result of the network after construction:

    • The estimated size of the model when prune_rate is 0.75 is 48.90MB
    • The estimated size of the model when prune_rate is 0.85 is 50.61MB
    • The estimated model size when prune_rate is 0.95 is 52.85MB

    It can be found that the smaller the prune_rate, the better the pruning compression effect.

  • Effect of prune_count

    Divide prune_count into 1, 2, 3 for experiment

    The result of the network after construction:

    • The estimated size of the model when prune_count is 1 is 53.74MB
    • The estimated size of the model when prune_count is 2 is 53.29MB
    • The estimated size of the model when prune_count is 3 is 52.85MB

    It can be found that when the prune_count is smaller, the corresponding pruning compression effect is worse.

Parameter quantization exercise

Consult PyTorchthe reference documents, practice other quantitative methods, and do performance comparison analysis.

After consulting the Pytorch documentation, I found that pytorch provides an API called Eager Mode Quantization for quantization. This API provides 3 quantization modes, here I used its dynamic quantization and static quantization functions to quantify the model. Next, I will use this API to quantify the student network model.

Dynamic Model Quantization

According to the official documents, dynamic quantization is a relatively simple quantization, which only needs to specify the model, the layer to be quantized, and the quantization type. However, dynamic quantization generally only works on the linear layer and LSTM layer, and does not work on the convolutional layer. However, student_net has more convolutional layers, so it is initially estimated that the effect of dynamic quantization is not good. See below for the code implementation part of the detailed code 动态量化.py. Only the code snippets that are not displayed on the website are shown below:

  • load model

    student_net_fp32 = StudentNet(base=16)
    device = "cpu"
    student_net_fp32.load_state_dict(torch.load(f'./student_custom_small.bin'))
    print('Model Loaded')
    
  • Model Dynamic Quantization

    student_net_int8 = torch.quantization.quantize_dynamic(
        student_net_fp32,
        {
          
          torch.nn.Linear},
        dtype=torch.qint8)
    
  • Validation set loading and model time efficiency evaluation

    valid_dataloader = data_load()
    student_net_fp32.eval()
    student_net_int8.eval()
    fp32_st = time.time()
    valid_loss_fp32 = run_test_epoch(valid_dataloader, student_net_fp32)
    fp32_time = time.time() - fp32_st
    int8_st = time.time()
    valid_loss_int8 = run_test_epoch(valid_dataloader, student_net_int8)
    int8_time = time.time() - int8_st
    print("valid_loss_fp32:",valid_loss_fp32,",time:",fp32_time)
    print("valid_loss_int8:",valid_loss_int8,",time:",int8_time)
    
  • Model size comparison (code reference: https://github.com/pytorch/tutorials/blob/master/recipes_source/recipes/dynamic_quantization.py)

    def print_size_of_model(model, label=""):
        torch.save(model.state_dict(), "temp.p")
        size=os.path.getsize("temp.p")
        print("model: ",label,' \t','Size (KB):', size/1e3)
        os.remove('temp.p')
        return size
    
    # 模型大小比较
    f=print_size_of_model(student_net_fp32,"fp32")
    q=print_size_of_model(student_net_int8,"int8")
    print("{0:.2f} times smaller".format(f/q))
    

The final running results are as follows:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-G8F9ZSAd-1690719177662) (D:\Learning Materials\Junior 3\Junior 3\Junior 3 \Deep Learning Practice\Assignment\Experiment\Experiment 5\Document Records.assets\image-20221205203524451.png)]

It can be seen that the effect of dynamic quantization is not very good. In terms of inference time, the int type is even greater than that of the fp32 prototype, and the result after quantization is worse than that before quantization. The accuracy rates of the two are basically the same. Regarding the size of the final model, the quantized model basically has no advantage at all. The quantized model size is 1045KB and the pre-quantized model is 1053KB, which is not much different.

static model quantization

Static model quantization is a bit more complicated than dynamic quantization. Compared with dynamic quantization, they all convert the weight parameters of the network from float32 to int8. However, there is a big difference between them, that is, static quantization needs to feed the training set or data similar to the distribution of the training set to the model, and then calculate the quantization parameters of activation according to the distribution characteristics of each op input. Static quantization is more suitable for convolutional neural networks, and the student_net used in the experiment is a convolutional neural network, so static quantization should have a better effect on this, and the following is the code implementation part of the detailed code 静态量化.py. The content of the code is mainly consistent with that of dynamic quantification. The following mainly shows the quantified code:

valid_dataloader = data_load()
student_net_fp32.eval()
student_net_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
student_net_fp32_prepared = torch.quantization.prepare(student_net_fp32)
# 先读取部分数据用于定位
for batch_data in tqdm(valid_dataloader):
    # 获取数据
    inputs, hard_labels = batch_data
    # 只是做validation的话,就不用计算梯度
    with torch.no_grad():
        student_net_fp32_prepared(inputs.to(device))

student_net_int8 = torch.quantization.convert(student_net_fp32_prepared)

In addition to defining quantitative methods, its network structure also needs to be added. The quantization method and inverse quantization method need to be defined during initialization, as follows:

class StudentNet(nn.Module):
    def __init__(self, base=16, width_mult=1):
        super(StudentNet, self).__init__()
        multiplier = [1, 2, 4, 8, 16, 16, 16, 16]
        bandwidth = [base * m for m in multiplier]  # 每层输出的channel数量
        for i in range(3, 7):  # 对3/4/5/6层进行剪枝
            bandwidth[i] = int(bandwidth[i] * width_mult)
        self.cnn = nn.Sequential(...)
        # 直接将CNN的输出映射到11维作为最终输出
        self.fc = nn.Sequential(
            nn.Linear(bandwidth[7], 11)
        )
        self.quant = torch.quantization.QuantStub()
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.cnn(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        x = self.dequant(x)
        return x

Finally, reasoning is performed on the validation set to obtain the following results,

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-DKCYM2ge-1690719177663) (D:\Learning materials\Junior 3\Junior 3\Junior 3 \Deep Learning Practice\Assignment\Experiment\Experiment 5\Document Records.assets\image-20221205214118224.png)]

It can be seen that the accuracy rate after quantization is obviously low, which is twice the difference from before quantization, but the running time after quantization is slightly better than before quantization. The biggest achievement after quantization is that its model size is nearly three times smaller than that of the unquantized model. Although the size of the model is 3 times smaller, its accuracy rate is too low to be used effectively. The problem of accuracy rate may be related to the network structure.

Algorithmic detection model compression optimization

The arithmetic detection model is trained by yolo, which is based on yolov5sthe model as a pre-trained model. As a result of the training, it is found that there are still many places for optimization. For example, the size of the model can be compressed by pruning, quantization, etc. to save storage space, and the training speed can also be accelerated through this. In these compression processes, although the accuracy of the model will be reduced, its value is still great compared to the reduction of storage space and the improvement of the number of inferences. In the following, we will compress and optimize the algorithm recognition model in terms of model pruning and quantization to achieve better results.

Model size and speed before optimization

This formula recognition model is pre-trained, and finally equation.ptthe weight file is obtained. First, val.pycheck the reasoning effect of the model on the verification set through the file that comes with yolo. After entering yolov5the folder, enter the following command to evaluate:

python val.py --weights ../equation.pt --data equation.yaml --img 640

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-WKuRkJHP-1690719177664) (D:\Learning Materials\Junior 3\Junior 3\Junior 3 \Deep Learning Practice\Assignment\Experiment\Experiment 5\Document Records.assets\image-20221206164134924.png)]

It can be seen that its precision rate, recall rate, and mAP50 are 0.997, 0.999, and 0.994 respectively, while its preprocessing time is 1.8ms per photo, and its inference time is 227.9ms per photo.

Algorithmic detection model pruning

  • The pruning method provided by yolo

    According to yolo's documentation (https://github.com/ultralytics/yolov5/issues/304), val.pyinsert model pruning statements in to achieve the effect of simple pruning.

    val.pyYou need to add the following code in line 156 of the yolo source file .

    [External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-O4bqErDW-1690719177665) (D:\Learning Materials\Junior 3\Junior 3\Junior 3 \Deep Learning Practice\Assignment\Experiment\Experiment 5\Document Records.assets\image-20221206184209672.png)]

    # prune
    from utils.torch_utils import prune
    prune(model, 0.3)
    

    After searching this code, it was found that in fact, yolo has a tool for pruning. In utilsthe ``torch_utils.py` file under the folder, the specific code is as follows:

    def prune(model, amount=0.3):
        # Prune model to requested global sparsity
        import torch.nn.utils.prune as prune
        for name, m in model.named_modules():
            if isinstance(m, nn.Conv2d):
                prune.l1_unstructured(m, name='weight', amount=amount)  # prune
                prune.remove(m, 'weight')  # make permanent
        LOGGER.info(f'Model pruned to {
            
            sparsity(model):.3g} global sparsity')
    

    It can be found that it uses the API interface of pytorch for model pruning, and a default 30% pruning is performed for each layer containing convolution.

    After embedding the above code in val.py, start inference on the validation set and see the changes. The result is as follows:

    [External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-yoOiHobg-1690719177666) (D:\Learning materials\Junior 3\Junior 3\Junior 3 \Deep Learning Practice\Assignment\Experiment\Experiment 5\Document Records.assets\image-20221206184905447.png)]

    It can be seen that its precision rate, recall rate, and mAP50 values ​​have all decreased slightly, while its running time has basically not changed much. After checking the github issue on yolo later, I found that their results are similar. After pruning, the effect is basically no, and the size is not compressed.

  • Alternative pruning methods

    In addition to the pruning method provided by yolo, some other pruning methods were also found on the Internet (https://github.com/ZJU-lishuang/yolov5_prune). Now try this method to prune the model.

    However, after the test, it was found that this method is not perfect, and the next step cannot be performed after dealing with many error reports, so the method of pruning was finally abandoned.

Algorithmic Detection Model Quantization

Trying to find the quantification method of yolo's detection model, but I have not found it. Finally, I found the corresponding yolov5 model quantification problem in the issue of github, but found that the problem was raised in 20 years, but it has not been solved in 22 years. Yolo The author said that yolo running on the cpu cannot perform int8 quantization, so he finally gave up the quantization of the model.

https://github.com/ultralytics/yolov5/issues/1288

Algorithm recognition model compression optimization

For the arithmetic recognition model, I use the method of model quantization for compression and optimization. The arithmetic recognition model here is a text recognition model that I gave up before. This model is directly provided by easyocr. Since I chose paddleocr for training later, so Abandoned easycr, and paddleocr has not been trained yet, so easycr is selected as the experimental object here to compress and optimize it.

As for compression optimization, I chose the quantization method to convert 32-bit floating-point numbers into int8 to achieve compression on storage and acceleration of inference. This is only used to compress the model, and the inference is not evaluated.

  • Model download

    The download of the model is mainly a link to the author's github: https://github.com/JaidedAI/EasyOCR/blob/master/custom_model.md

    After the download is complete, there are three files named custom_example.pth, custom_example.py, custom_example.yaml, which are the weight file, neural network file and configuration file, and only the first two are used here.

    [External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-ewzMjWRD-1690719177667) (D:\Learning Materials\Junior 3\Junior 3\Junior 3 \Deep Learning Practice\Assignment\Experiment\Experiment 5\Document Records.assets\image-20221206214753514.png)]

  • model loading

    For the loading of the model, custom_example.pyyou can directly edit it in,

    Add the following code on the basis of the source code to load,

    # 模型加载
    model = Model(input_channel=1,output_channel=256,hidden_size=256,num_class=97)
    dic = torch.load(f'./custom_example.pth')
    model.load_state_dict(dic,False)
    
  • dynamic quantization

    It can be known that this neural network has many LSTM layers, so dynamic quantization is suitable here. The dynamic quantization code is as follows:

    model_int8 = torch.quantization.quantize_dynamic(
        model,
        {
          
          torch.nn.Linear, torch.nn.LSTM},
        dtype=torch.qint8)
    
  • Quantitative results before and after comparison

    For the quantitative results, only the size of the model is discussed here, and a function is defined to obtain the size of the model and compare it. Its code is as follows:

    def print_size_of_model(model, label=""):
        torch.save(model.state_dict(), "temp.p")
        size=os.path.getsize("temp.p")
        print("model: ",label,' \t','Size (KB):', size/1e3)
        os.remove('temp.p')
        return size
    
    # 模型大小比较
    f=print_size_of_model(model,"fp32")
    q=print_size_of_model(model_int8,"int8")
    

    The output results are as follows:

    [External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-9rubvI7A-1690719177669) (D:\Learning Materials\Junior 3\Junior 3\Junior 3 \Deep Learning Practice\Assignment\Experiment\Experiment 5\Document Records.assets\image-20221206215301213-16703347822271.png)]

    It can be seen that the size of the model is compressed to the original 1 2 \frac{1}{2}21, indicating that the dynamic quantization has optimized the model to a certain extent.

Experimental results

In this experiment, a number of exercises in the basic requirements have been successfully completed, including exploring the impact of changes in the number of channels in the architecture design on the neural network, the reason why the student network is not better than the teacher network, the impact of pruning parameters, The implementation of pytorch quantization method and so on. Among them, it is found that the model size in the model architecture design exercise is positively correlated with the base value and width_mult value. In the practice of knowledge distillation, it is found that the reason that hinders the further improvement of the student network may be the structure of the neural network. Compared with the student network, the teacher network is deeper and the effect is better. In the model quantization exercise, dynamic quantization and static quantization were reproduced, and it was found that dynamic quantization is more suitable for neural networks with linear layers and LSTMs, while static quantization is more suitable for convolutional neural networks.
In addition to completing the basic requirements, I also tried compression optimization for the algorithm detection model and the algorithm content extraction model. For the detection model, since the yolo tool is used for training and use, the model pruning interface of yolo is directly used for compression. However, the compressed result is not good, and the accuracy rate is slightly reduced, but the model size is small. No change, nor the speed of inference. Finally, it is guessed that this may be related to the fact that the pruning interface directly changes the parameter to 0 instead of removing it, and it may also be related to the running device being a CPU. For the model of content extraction, I use the model of easyocr. For this model, I used a quantization method to change its parameters from 32-bit floating-point numbers to 8-bit floating-point numbers. Finally, the size of the model was reduced in general, and at the same time, the success of the optimization was verified.

Guess you like

Origin blog.csdn.net/weixin_51735061/article/details/132010835