Series of experiments
Deep learning practice - Convolutional neural network practice: Crack identification
Deep learning practice - Recurrent neural network practice
Deep learning practice - Model deployment optimization practice
Deep learning practice - Model reasoning optimization practice
Deep Learning Practice - Model Reasoning Optimization Exercise
Source address: https://pan.baidu.com/s/1PuWZF2DkG0-F5pQLMIkTcQ?pwd=c24s
Model inference optimization exercise
Architecture Design Exercise
Through code modification, explore StudentNet
the impact of each parameter on the model parameter quantity.
The optimized compression in architecture design is mainly carried out by reducing the parameters of the neural network. Here, the model can be compressed and optimized by increasing or decreasing the number of channels and pruning the number of channels. In the source code given by the website, the model provides two parameters to adjust the channel. The first is the base
parameter, which is directly used to define the initial number of neuron channels. Secondly width_mult
, this parameter is the pruning control factor, when it is 1, it means no pruning. Number of channels after pruning = number of channels before pruning * width_mult.
According to the understanding of the parameters, it can be known that the smaller the base, the smaller the model compression, and the smaller the width_mult, the smaller the compression. Let's verify the hypothesis by modifying the code.
-
Default parameter output
First output the default value of the neural network layer and the parameter size for:
The main code is as follows, and the complete code can be found in
架构设计练习.py
model_default = StudentNet() model_default.eval() summary(model_default.to('cuda:0'), input_size=(3, 128, 128))
The output corresponding to the above code is as follows,
---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 16, 128, 128] 448 BatchNorm2d-2 [-1, 16, 128, 128] 32 ReLU6-3 [-1, 16, 128, 128] 0 MaxPool2d-4 [-1, 16, 64, 64] 0 Conv2d-5 [-1, 16, 64, 64] 160 BatchNorm2d-6 [-1, 16, 64, 64] 32 ReLU6-7 [-1, 16, 64, 64] 0 Conv2d-8 [-1, 32, 64, 64] 544 MaxPool2d-9 [-1, 32, 32, 32] 0 Conv2d-10 [-1, 32, 32, 32] 320 BatchNorm2d-11 [-1, 32, 32, 32] 64 ReLU6-12 [-1, 32, 32, 32] 0 Conv2d-13 [-1, 64, 32, 32] 2,112 MaxPool2d-14 [-1, 64, 16, 16] 0 Conv2d-15 [-1, 64, 16, 16] 640 BatchNorm2d-16 [-1, 64, 16, 16] 128 ReLU6-17 [-1, 64, 16, 16] 0 Conv2d-18 [-1, 128, 16, 16] 8,320 MaxPool2d-19 [-1, 128, 8, 8] 0 Conv2d-20 [-1, 128, 8, 8] 1,280 BatchNorm2d-21 [-1, 128, 8, 8] 256 ReLU6-22 [-1, 128, 8, 8] 0 Conv2d-23 [-1, 256, 8, 8] 33,024 Conv2d-24 [-1, 256, 8, 8] 2,560 BatchNorm2d-25 [-1, 256, 8, 8] 512 ReLU6-26 [-1, 256, 8, 8] 0 Conv2d-27 [-1, 256, 8, 8] 65,792 Conv2d-28 [-1, 256, 8, 8] 2,560 BatchNorm2d-29 [-1, 256, 8, 8] 512 ReLU6-30 [-1, 256, 8, 8] 0 Conv2d-31 [-1, 256, 8, 8] 65,792 Conv2d-32 [-1, 256, 8, 8] 2,560 BatchNorm2d-33 [-1, 256, 8, 8] 512 ReLU6-34 [-1, 256, 8, 8] 0 Conv2d-35 [-1, 256, 8, 8] 65,792 AdaptiveAvgPool2d-36 [-1, 256, 1, 1] 0 Linear-37 [-1, 11] 2,827 ================================================================ Total params: 256,779 Trainable params: 256,779 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.19 Forward/backward pass size (MB): 13.13 Params size (MB): 0.98 Estimated Total Size (MB): 14.29 ----------------------------------------------------------------
-
The result of lowering the base value
model_base12 = StudentNet(base=12) model_base12.eval() summary(model_base12.to('cuda:0'), input_size=(3, 128, 128))
The result is as follows:
---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 12, 128, 128] 336 BatchNorm2d-2 [-1, 12, 128, 128] 24 ReLU6-3 [-1, 12, 128, 128] 0 MaxPool2d-4 [-1, 12, 64, 64] 0 Conv2d-5 [-1, 12, 64, 64] 120 BatchNorm2d-6 [-1, 12, 64, 64] 24 ReLU6-7 [-1, 12, 64, 64] 0 Conv2d-8 [-1, 24, 64, 64] 312 MaxPool2d-9 [-1, 24, 32, 32] 0 Conv2d-10 [-1, 24, 32, 32] 240 BatchNorm2d-11 [-1, 24, 32, 32] 48 ReLU6-12 [-1, 24, 32, 32] 0 Conv2d-13 [-1, 48, 32, 32] 1,200 MaxPool2d-14 [-1, 48, 16, 16] 0 Conv2d-15 [-1, 48, 16, 16] 480 BatchNorm2d-16 [-1, 48, 16, 16] 96 ReLU6-17 [-1, 48, 16, 16] 0 Conv2d-18 [-1, 96, 16, 16] 4,704 MaxPool2d-19 [-1, 96, 8, 8] 0 Conv2d-20 [-1, 96, 8, 8] 960 BatchNorm2d-21 [-1, 96, 8, 8] 192 ReLU6-22 [-1, 96, 8, 8] 0 Conv2d-23 [-1, 192, 8, 8] 18,624 Conv2d-24 [-1, 192, 8, 8] 1,920 BatchNorm2d-25 [-1, 192, 8, 8] 384 ReLU6-26 [-1, 192, 8, 8] 0 Conv2d-27 [-1, 192, 8, 8] 37,056 Conv2d-28 [-1, 192, 8, 8] 1,920 BatchNorm2d-29 [-1, 192, 8, 8] 384 ReLU6-30 [-1, 192, 8, 8] 0 Conv2d-31 [-1, 192, 8, 8] 37,056 Conv2d-32 [-1, 192, 8, 8] 1,920 BatchNorm2d-33 [-1, 192, 8, 8] 384 ReLU6-34 [-1, 192, 8, 8] 0 Conv2d-35 [-1, 192, 8, 8] 37,056 AdaptiveAvgPool2d-36 [-1, 192, 1, 1] 0 Linear-37 [-1, 11] 2,123 ================================================================ Total params: 147,563 Trainable params: 147,563 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.19 Forward/backward pass size (MB): 9.85 Params size (MB): 0.56 Estimated Total Size (MB): 10.60 ----------------------------------------------------------------
It can be seen that compared with the default value, the number of variables in the network layer is reduced, the network layer sends changes, and the model is compressed. Then reduce the base value in turn, and draw the following figure with the model as the dependent variable base value as the independent variable.
It can be seen that the size of the model is basically proportional to the base value.
-
Result of lowering width_mult value
model_mul0_8 = StudentNet(width_mult=0.8) model_mul0_8.eval() summary(model_mul0_8.to('cuda:0'), input_size=(3, 128, 128))
The result is as follows:
---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 16, 128, 128] 448 BatchNorm2d-2 [-1, 16, 128, 128] 32 ReLU6-3 [-1, 16, 128, 128] 0 MaxPool2d-4 [-1, 16, 64, 64] 0 Conv2d-5 [-1, 16, 64, 64] 160 BatchNorm2d-6 [-1, 16, 64, 64] 32 ReLU6-7 [-1, 16, 64, 64] 0 Conv2d-8 [-1, 32, 64, 64] 544 MaxPool2d-9 [-1, 32, 32, 32] 0 Conv2d-10 [-1, 32, 32, 32] 320 BatchNorm2d-11 [-1, 32, 32, 32] 64 ReLU6-12 [-1, 32, 32, 32] 0 Conv2d-13 [-1, 64, 32, 32] 2,112 MaxPool2d-14 [-1, 64, 16, 16] 0 Conv2d-15 [-1, 64, 16, 16] 640 BatchNorm2d-16 [-1, 64, 16, 16] 128 ReLU6-17 [-1, 64, 16, 16] 0 Conv2d-18 [-1, 102, 16, 16] 6,630 MaxPool2d-19 [-1, 102, 8, 8] 0 Conv2d-20 [-1, 102, 8, 8] 1,020 BatchNorm2d-21 [-1, 102, 8, 8] 204 ReLU6-22 [-1, 102, 8, 8] 0 Conv2d-23 [-1, 204, 8, 8] 21,012 Conv2d-24 [-1, 204, 8, 8] 2,040 BatchNorm2d-25 [-1, 204, 8, 8] 408 ReLU6-26 [-1, 204, 8, 8] 0 Conv2d-27 [-1, 204, 8, 8] 41,820 Conv2d-28 [-1, 204, 8, 8] 2,040 BatchNorm2d-29 [-1, 204, 8, 8] 408 ReLU6-30 [-1, 204, 8, 8] 0 Conv2d-31 [-1, 204, 8, 8] 41,820 Conv2d-32 [-1, 204, 8, 8] 2,040 BatchNorm2d-33 [-1, 204, 8, 8] 408 ReLU6-34 [-1, 204, 8, 8] 0 Conv2d-35 [-1, 256, 8, 8] 52,480 AdaptiveAvgPool2d-36 [-1, 256, 1, 1] 0 Linear-37 [-1, 11] 2,827 ================================================================ Total params: 179,637 Trainable params: 179,637 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.19 Forward/backward pass size (MB): 12.72 Params size (MB): 0.69 Estimated Total Size (MB): 13.59 ----------------------------------------------------------------
It can be seen that the size of the model is basically proportional to the width_mul value, but its compression range is limited relative to the base value.
knowledge distillation exercise
It can be seen from the case that the performance of the distilled student model is much lower than that of the pre-trained teacher model. Please analyze the reasons and explore ways to further improve the performance of the student model. .
reason:
From the case on the website, it can be seen that the student network has been trained for many rounds, and theoretically it should be similar to the accuracy of the teacher network, but the results show that it is still much worse. There are two major differences between the student network and the teacher network. One of them is that the teacher network has been fully trained, while the student network has not been trained at the beginning; the second is that the structure of the student network and the teacher network are not consistent.
For the first difference, it can be eliminated by sufficient training through knowledge distillation, but the second cannot. Therefore, one of the reasons why its performance is not as good as that of the teacher network should be its network structure. Then print the structure of the teacher network and the student network for comparison, and 知识蒸馏.py
print through the following code (see the specific code).
teacher_net = models.resnet18(pretrained=False, num_classes=11)
teacher_net.load_state_dict(torch.load(f'./teacher_resnet18.bin'))
student_net = StudentNet(base=16)
print("teacher Net")
summary(teacher_net.to('cuda:0'), input_size=(3, 128, 128))
print("\n\n\nstudent Net")
summary(student_net.to('cuda:0'), input_size=(3, 128, 128))
-
teacher network
---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 64, 64, 64] 9,408 BatchNorm2d-2 [-1, 64, 64, 64] 128 ReLU-3 [-1, 64, 64, 64] 0 MaxPool2d-4 [-1, 64, 32, 32] 0 Conv2d-5 [-1, 64, 32, 32] 36,864 BatchNorm2d-6 [-1, 64, 32, 32] 128 ReLU-7 [-1, 64, 32, 32] 0 Conv2d-8 [-1, 64, 32, 32] 36,864 BatchNorm2d-9 [-1, 64, 32, 32] 128 ReLU-10 [-1, 64, 32, 32] 0 BasicBlock-11 [-1, 64, 32, 32] 0 Conv2d-12 [-1, 64, 32, 32] 36,864 BatchNorm2d-13 [-1, 64, 32, 32] 128 ReLU-14 [-1, 64, 32, 32] 0 Conv2d-15 [-1, 64, 32, 32] 36,864 BatchNorm2d-16 [-1, 64, 32, 32] 128 ReLU-17 [-1, 64, 32, 32] 0 BasicBlock-18 [-1, 64, 32, 32] 0 Conv2d-19 [-1, 128, 16, 16] 73,728 BatchNorm2d-20 [-1, 128, 16, 16] 256 ReLU-21 [-1, 128, 16, 16] 0 Conv2d-22 [-1, 128, 16, 16] 147,456 BatchNorm2d-23 [-1, 128, 16, 16] 256 Conv2d-24 [-1, 128, 16, 16] 8,192 BatchNorm2d-25 [-1, 128, 16, 16] 256 ReLU-26 [-1, 128, 16, 16] 0 BasicBlock-27 [-1, 128, 16, 16] 0 Conv2d-28 [-1, 128, 16, 16] 147,456 BatchNorm2d-29 [-1, 128, 16, 16] 256 ReLU-30 [-1, 128, 16, 16] 0 Conv2d-31 [-1, 128, 16, 16] 147,456 BatchNorm2d-32 [-1, 128, 16, 16] 256 ReLU-33 [-1, 128, 16, 16] 0 BasicBlock-34 [-1, 128, 16, 16] 0 Conv2d-35 [-1, 256, 8, 8] 294,912 BatchNorm2d-36 [-1, 256, 8, 8] 512 ReLU-37 [-1, 256, 8, 8] 0 Conv2d-38 [-1, 256, 8, 8] 589,824 BatchNorm2d-39 [-1, 256, 8, 8] 512 Conv2d-40 [-1, 256, 8, 8] 32,768 BatchNorm2d-41 [-1, 256, 8, 8] 512 ReLU-42 [-1, 256, 8, 8] 0 BasicBlock-43 [-1, 256, 8, 8] 0 Conv2d-44 [-1, 256, 8, 8] 589,824 BatchNorm2d-45 [-1, 256, 8, 8] 512 ReLU-46 [-1, 256, 8, 8] 0 Conv2d-47 [-1, 256, 8, 8] 589,824 BatchNorm2d-48 [-1, 256, 8, 8] 512 ReLU-49 [-1, 256, 8, 8] 0 BasicBlock-50 [-1, 256, 8, 8] 0 Conv2d-51 [-1, 512, 4, 4] 1,179,648 BatchNorm2d-52 [-1, 512, 4, 4] 1,024 ReLU-53 [-1, 512, 4, 4] 0 Conv2d-54 [-1, 512, 4, 4] 2,359,296 BatchNorm2d-55 [-1, 512, 4, 4] 1,024 Conv2d-56 [-1, 512, 4, 4] 131,072 BatchNorm2d-57 [-1, 512, 4, 4] 1,024 ReLU-58 [-1, 512, 4, 4] 0 BasicBlock-59 [-1, 512, 4, 4] 0 Conv2d-60 [-1, 512, 4, 4] 2,359,296 BatchNorm2d-61 [-1, 512, 4, 4] 1,024 ReLU-62 [-1, 512, 4, 4] 0 Conv2d-63 [-1, 512, 4, 4] 2,359,296 BatchNorm2d-64 [-1, 512, 4, 4] 1,024 ReLU-65 [-1, 512, 4, 4] 0 BasicBlock-66 [-1, 512, 4, 4] 0 AdaptiveAvgPool2d-67 [-1, 512, 1, 1] 0 Linear-68 [-1, 11] 5,643 ================================================================ Total params: 11,182,155 Trainable params: 11,182,155 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.19 Forward/backward pass size (MB): 20.50 Params size (MB): 42.66 Estimated Total Size (MB): 63.35 ----------------------------------------------------------------
-
student network
---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 16, 128, 128] 448 BatchNorm2d-2 [-1, 16, 128, 128] 32 ReLU6-3 [-1, 16, 128, 128] 0 MaxPool2d-4 [-1, 16, 64, 64] 0 Conv2d-5 [-1, 16, 64, 64] 160 BatchNorm2d-6 [-1, 16, 64, 64] 32 ReLU6-7 [-1, 16, 64, 64] 0 Conv2d-8 [-1, 32, 64, 64] 544 MaxPool2d-9 [-1, 32, 32, 32] 0 Conv2d-10 [-1, 32, 32, 32] 320 BatchNorm2d-11 [-1, 32, 32, 32] 64 ReLU6-12 [-1, 32, 32, 32] 0 Conv2d-13 [-1, 64, 32, 32] 2,112 MaxPool2d-14 [-1, 64, 16, 16] 0 Conv2d-15 [-1, 64, 16, 16] 640 BatchNorm2d-16 [-1, 64, 16, 16] 128 ReLU6-17 [-1, 64, 16, 16] 0 Conv2d-18 [-1, 128, 16, 16] 8,320 MaxPool2d-19 [-1, 128, 8, 8] 0 Conv2d-20 [-1, 128, 8, 8] 1,280 BatchNorm2d-21 [-1, 128, 8, 8] 256 ReLU6-22 [-1, 128, 8, 8] 0 Conv2d-23 [-1, 256, 8, 8] 33,024 Conv2d-24 [-1, 256, 8, 8] 2,560 BatchNorm2d-25 [-1, 256, 8, 8] 512 ReLU6-26 [-1, 256, 8, 8] 0 Conv2d-27 [-1, 256, 8, 8] 65,792 Conv2d-28 [-1, 256, 8, 8] 2,560 BatchNorm2d-29 [-1, 256, 8, 8] 512 ReLU6-30 [-1, 256, 8, 8] 0 Conv2d-31 [-1, 256, 8, 8] 65,792 Conv2d-32 [-1, 256, 8, 8] 2,560 BatchNorm2d-33 [-1, 256, 8, 8] 512 ReLU6-34 [-1, 256, 8, 8] 0 Conv2d-35 [-1, 256, 8, 8] 65,792 AdaptiveAvgPool2d-36 [-1, 256, 1, 1] 0 Linear-37 [-1, 11] 2,827 ================================================================ Total params: 256,779 Trainable params: 256,779 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.19 Forward/backward pass size (MB): 13.13 Params size (MB): 0.98 Estimated Total Size (MB): 14.29 ----------------------------------------------------------------
From the above two output results, it can be seen that the student network is relatively small compared to the teacher network. The teacher network has a total of 11,182,155 parameters, while the student network has only 256,779. And we know that the more variables the model has, the better the fitting effect should be. Then we can see that the teacher network is obviously more than the student network, so the accuracy rate must be higher than that of the student network.
Methods to improve the performance of the student model
- For the improvement of the student model, we can see from the analysis of the reasons above. If the network structure can be modified, it can be improved from the network structure.
- Secondly, we can find a more powerful teacher model for knowledge distillation to achieve higher accuracy.
- It is also possible to perform more intensive training by increasing the data set.
- Find better results by tweaking parameters.
Model pruning exercise
For the example in Type Single Module Clipping, method clipping is performed on conv1
it .bias
L1unstructured
For the pruning of bias, you only need to change the specified value relative to the weight. The pruning code is basically the same as that on the website, and only the following parts are changed (see the specific code) 模型剪枝1.py
:
module = model.conv1
print(module.bias)
prune.l1_unstructured(module,name="bias",amount=0.3)
print(module.bias)
After running the above code, we can get,
Parameter containing:
tensor([-0.2817, -0.0636, 0.0237, 0.2616, -0.3117, -0.0650], device='cuda:0',
requires_grad=True)
tensor([-0.2817, -0.0000, 0.0000, 0.2616, -0.3117, -0.0650], device='cuda:0j',
grad_fn=<MulBackward0>)
It can be seen that except for the 2nd and 3rd numbers which are changed to 0 and cut off, the others remain unchanged.
In actual combat cases, does batchsize have an impact on cropping performance? What about other hyperparameters?
Adjust the clipping parameters one by one and carry out the following experiments. For relevant codes, see模型剪枝2.py
-
The impact of batch size
First, the data set is reduced, and then the batchsize is modified to 24, 48, and 72 to compare the output pruning results. The specific code can be seen
模型剪枝2.py
. The results obtained are as follows:The result of the network after construction:
- The estimated size of the model when batchsize is 72 is 52.85MB
- The estimated size of the model when the batchsize is 48 is 52.85MB
- The estimated size of the model when the batchsize is 24 is 52.85MB
It can be found that batchsize has no effect on the effect of pruning.
-
The impact of prune_rate
Divide prune_rate into 0.75, 0.85, 0.95 for experiment
The result of the network after construction:
- The estimated size of the model when prune_rate is 0.75 is 48.90MB
- The estimated size of the model when prune_rate is 0.85 is 50.61MB
- The estimated model size when prune_rate is 0.95 is 52.85MB
It can be found that the smaller the prune_rate, the better the pruning compression effect.
-
Effect of prune_count
Divide prune_count into 1, 2, 3 for experiment
The result of the network after construction:
- The estimated size of the model when prune_count is 1 is 53.74MB
- The estimated size of the model when prune_count is 2 is 53.29MB
- The estimated size of the model when prune_count is 3 is 52.85MB
It can be found that when the prune_count is smaller, the corresponding pruning compression effect is worse.
Parameter quantization exercise
Consult PyTorch
the reference documents, practice other quantitative methods, and do performance comparison analysis.
After consulting the Pytorch documentation, I found that pytorch provides an API called Eager Mode Quantization for quantization. This API provides 3 quantization modes, here I used its dynamic quantization and static quantization functions to quantify the model. Next, I will use this API to quantify the student network model.
Dynamic Model Quantization
According to the official documents, dynamic quantization is a relatively simple quantization, which only needs to specify the model, the layer to be quantized, and the quantization type. However, dynamic quantization generally only works on the linear layer and LSTM layer, and does not work on the convolutional layer. However, student_net has more convolutional layers, so it is initially estimated that the effect of dynamic quantization is not good. See below for the code implementation part of the detailed code 动态量化.py
. Only the code snippets that are not displayed on the website are shown below:
-
load model
student_net_fp32 = StudentNet(base=16) device = "cpu" student_net_fp32.load_state_dict(torch.load(f'./student_custom_small.bin')) print('Model Loaded')
-
Model Dynamic Quantization
student_net_int8 = torch.quantization.quantize_dynamic( student_net_fp32, { torch.nn.Linear}, dtype=torch.qint8)
-
Validation set loading and model time efficiency evaluation
valid_dataloader = data_load() student_net_fp32.eval() student_net_int8.eval() fp32_st = time.time() valid_loss_fp32 = run_test_epoch(valid_dataloader, student_net_fp32) fp32_time = time.time() - fp32_st int8_st = time.time() valid_loss_int8 = run_test_epoch(valid_dataloader, student_net_int8) int8_time = time.time() - int8_st print("valid_loss_fp32:",valid_loss_fp32,",time:",fp32_time) print("valid_loss_int8:",valid_loss_int8,",time:",int8_time)
-
Model size comparison (code reference: https://github.com/pytorch/tutorials/blob/master/recipes_source/recipes/dynamic_quantization.py)
def print_size_of_model(model, label=""): torch.save(model.state_dict(), "temp.p") size=os.path.getsize("temp.p") print("model: ",label,' \t','Size (KB):', size/1e3) os.remove('temp.p') return size # 模型大小比较 f=print_size_of_model(student_net_fp32,"fp32") q=print_size_of_model(student_net_int8,"int8") print("{0:.2f} times smaller".format(f/q))
The final running results are as follows:
It can be seen that the effect of dynamic quantization is not very good. In terms of inference time, the int type is even greater than that of the fp32 prototype, and the result after quantization is worse than that before quantization. The accuracy rates of the two are basically the same. Regarding the size of the final model, the quantized model basically has no advantage at all. The quantized model size is 1045KB and the pre-quantized model is 1053KB, which is not much different.
static model quantization
Static model quantization is a bit more complicated than dynamic quantization. Compared with dynamic quantization, they all convert the weight parameters of the network from float32 to int8. However, there is a big difference between them, that is, static quantization needs to feed the training set or data similar to the distribution of the training set to the model, and then calculate the quantization parameters of activation according to the distribution characteristics of each op input. Static quantization is more suitable for convolutional neural networks, and the student_net used in the experiment is a convolutional neural network, so static quantization should have a better effect on this, and the following is the code implementation part of the detailed code 静态量化.py
. The content of the code is mainly consistent with that of dynamic quantification. The following mainly shows the quantified code:
valid_dataloader = data_load()
student_net_fp32.eval()
student_net_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
student_net_fp32_prepared = torch.quantization.prepare(student_net_fp32)
# 先读取部分数据用于定位
for batch_data in tqdm(valid_dataloader):
# 获取数据
inputs, hard_labels = batch_data
# 只是做validation的话,就不用计算梯度
with torch.no_grad():
student_net_fp32_prepared(inputs.to(device))
student_net_int8 = torch.quantization.convert(student_net_fp32_prepared)
In addition to defining quantitative methods, its network structure also needs to be added. The quantization method and inverse quantization method need to be defined during initialization, as follows:
class StudentNet(nn.Module):
def __init__(self, base=16, width_mult=1):
super(StudentNet, self).__init__()
multiplier = [1, 2, 4, 8, 16, 16, 16, 16]
bandwidth = [base * m for m in multiplier] # 每层输出的channel数量
for i in range(3, 7): # 对3/4/5/6层进行剪枝
bandwidth[i] = int(bandwidth[i] * width_mult)
self.cnn = nn.Sequential(...)
# 直接将CNN的输出映射到11维作为最终输出
self.fc = nn.Sequential(
nn.Linear(bandwidth[7], 11)
)
self.quant = torch.quantization.QuantStub()
self.dequant = torch.quantization.DeQuantStub()
def forward(self, x):
x = self.quant(x)
x = self.cnn(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
x = self.dequant(x)
return x
Finally, reasoning is performed on the validation set to obtain the following results,
It can be seen that the accuracy rate after quantization is obviously low, which is twice the difference from before quantization, but the running time after quantization is slightly better than before quantization. The biggest achievement after quantization is that its model size is nearly three times smaller than that of the unquantized model. Although the size of the model is 3 times smaller, its accuracy rate is too low to be used effectively. The problem of accuracy rate may be related to the network structure.
Algorithmic detection model compression optimization
The arithmetic detection model is trained by yolo, which is based on yolov5s
the model as a pre-trained model. As a result of the training, it is found that there are still many places for optimization. For example, the size of the model can be compressed by pruning, quantization, etc. to save storage space, and the training speed can also be accelerated through this. In these compression processes, although the accuracy of the model will be reduced, its value is still great compared to the reduction of storage space and the improvement of the number of inferences. In the following, we will compress and optimize the algorithm recognition model in terms of model pruning and quantization to achieve better results.
Model size and speed before optimization
This formula recognition model is pre-trained, and finally equation.pt
the weight file is obtained. First, val.py
check the reasoning effect of the model on the verification set through the file that comes with yolo. After entering yolov5
the folder, enter the following command to evaluate:
python val.py --weights ../equation.pt --data equation.yaml --img 640
It can be seen that its precision rate, recall rate, and mAP50 are 0.997, 0.999, and 0.994 respectively, while its preprocessing time is 1.8ms per photo, and its inference time is 227.9ms per photo.
Algorithmic detection model pruning
-
The pruning method provided by yolo
According to yolo's documentation (https://github.com/ultralytics/yolov5/issues/304),
val.py
insert model pruning statements in to achieve the effect of simple pruning.val.py
You need to add the following code in line 156 of the yolo source file .# prune from utils.torch_utils import prune prune(model, 0.3)
After searching this code, it was found that in fact, yolo has a tool for pruning. In
utils
the ``torch_utils.py` file under the folder, the specific code is as follows:def prune(model, amount=0.3): # Prune model to requested global sparsity import torch.nn.utils.prune as prune for name, m in model.named_modules(): if isinstance(m, nn.Conv2d): prune.l1_unstructured(m, name='weight', amount=amount) # prune prune.remove(m, 'weight') # make permanent LOGGER.info(f'Model pruned to { sparsity(model):.3g} global sparsity')
It can be found that it uses the API interface of pytorch for model pruning, and a default 30% pruning is performed for each layer containing convolution.
After embedding the above code in
val.py
, start inference on the validation set and see the changes. The result is as follows:It can be seen that its precision rate, recall rate, and mAP50 values have all decreased slightly, while its running time has basically not changed much. After checking the github issue on yolo later, I found that their results are similar. After pruning, the effect is basically no, and the size is not compressed.
-
Alternative pruning methods
In addition to the pruning method provided by yolo, some other pruning methods were also found on the Internet (https://github.com/ZJU-lishuang/yolov5_prune). Now try this method to prune the model.
However, after the test, it was found that this method is not perfect, and the next step cannot be performed after dealing with many error reports, so the method of pruning was finally abandoned.
Algorithmic Detection Model Quantization
Trying to find the quantification method of yolo's detection model, but I have not found it. Finally, I found the corresponding yolov5 model quantification problem in the issue of github, but found that the problem was raised in 20 years, but it has not been solved in 22 years. Yolo The author said that yolo running on the cpu cannot perform int8 quantization, so he finally gave up the quantization of the model.
https://github.com/ultralytics/yolov5/issues/1288
Algorithm recognition model compression optimization
For the arithmetic recognition model, I use the method of model quantization for compression and optimization. The arithmetic recognition model here is a text recognition model that I gave up before. This model is directly provided by easyocr. Since I chose paddleocr for training later, so Abandoned easycr, and paddleocr has not been trained yet, so easycr is selected as the experimental object here to compress and optimize it.
As for compression optimization, I chose the quantization method to convert 32-bit floating-point numbers into int8 to achieve compression on storage and acceleration of inference. This is only used to compress the model, and the inference is not evaluated.
-
Model download
The download of the model is mainly a link to the author's github: https://github.com/JaidedAI/EasyOCR/blob/master/custom_model.md
After the download is complete, there are three files named
custom_example.pth
,custom_example.py
,custom_example.yaml
, which are the weight file, neural network file and configuration file, and only the first two are used here. -
model loading
For the loading of the model,
custom_example.py
you can directly edit it in,Add the following code on the basis of the source code to load,
# 模型加载 model = Model(input_channel=1,output_channel=256,hidden_size=256,num_class=97) dic = torch.load(f'./custom_example.pth') model.load_state_dict(dic,False)
-
dynamic quantization
It can be known that this neural network has many LSTM layers, so dynamic quantization is suitable here. The dynamic quantization code is as follows:
model_int8 = torch.quantization.quantize_dynamic( model, { torch.nn.Linear, torch.nn.LSTM}, dtype=torch.qint8)
-
Quantitative results before and after comparison
For the quantitative results, only the size of the model is discussed here, and a function is defined to obtain the size of the model and compare it. Its code is as follows:
def print_size_of_model(model, label=""): torch.save(model.state_dict(), "temp.p") size=os.path.getsize("temp.p") print("model: ",label,' \t','Size (KB):', size/1e3) os.remove('temp.p') return size # 模型大小比较 f=print_size_of_model(model,"fp32") q=print_size_of_model(model_int8,"int8")
The output results are as follows:
It can be seen that the size of the model is compressed to the original 1 2 \frac{1}{2}21, indicating that the dynamic quantization has optimized the model to a certain extent.
Experimental results
In this experiment, a number of exercises in the basic requirements have been successfully completed, including exploring the impact of changes in the number of channels in the architecture design on the neural network, the reason why the student network is not better than the teacher network, the impact of pruning parameters, The implementation of pytorch quantization method and so on. Among them, it is found that the model size in the model architecture design exercise is positively correlated with the base value and width_mult value. In the practice of knowledge distillation, it is found that the reason that hinders the further improvement of the student network may be the structure of the neural network. Compared with the student network, the teacher network is deeper and the effect is better. In the model quantization exercise, dynamic quantization and static quantization were reproduced, and it was found that dynamic quantization is more suitable for neural networks with linear layers and LSTMs, while static quantization is more suitable for convolutional neural networks.
In addition to completing the basic requirements, I also tried compression optimization for the algorithm detection model and the algorithm content extraction model. For the detection model, since the yolo tool is used for training and use, the model pruning interface of yolo is directly used for compression. However, the compressed result is not good, and the accuracy rate is slightly reduced, but the model size is small. No change, nor the speed of inference. Finally, it is guessed that this may be related to the fact that the pruning interface directly changes the parameter to 0 instead of removing it, and it may also be related to the running device being a CPU. For the model of content extraction, I use the model of easyocr. For this model, I used a quantization method to change its parameters from 32-bit floating-point numbers to 8-bit floating-point numbers. Finally, the size of the model was reduced in general, and at the same time, the success of the optimization was verified.