Use RKNN to deploy CRNN model and step on pit optimization process

Preface

Some time ago, I used RKNN to deploy a text recognition model, because the text recognition model uses the most commonly used CRNN model, and the structure is relatively simple: convolution + LSTM + full connection, which are relatively veteran-level operators and have been deployed. The process will go smoothly, but it turns out that there are still many pitfalls. Write an article to record the pit optimization process.

1. Before deployment

After receiving the demand, the first thing I thought of was to directly use the open source general text recognition model of paddleocr. paddleocr provides three general text recognition models, all based on the CRNN algorithm architecture, namely: mobilenetV3 version, LCNet version, Resnet34 version, The first two are mobile-side models, and the latter one is a server-side model. Because it is deployed on the board, the server-side model will definitely not be used. Then we can only try the mobile-side model, but in the process, we found that it is not suitable for the specific model. The scene accuracy is still a bit insufficient. Although these models are still good in overall general text recognition, because my scene requires a relatively high recognition rate, the two mobile models cannot be used in vain. There is no way. , I can only train one myself.

Based on the text_renderer warehouse, a batch of benchmark data sets for training recognition models were generated. The data volume is about 650w, including text and picture data sets of more than 100 Chinese fonts, for model retraining;

Model training is based on PytorchOCR (paddleOcr is really not used to it).

At the beginning, the mobilenetV3 version of CRNN was used for training, but the accuracy and generalization ability of the model after training were very poor. Considering that ResNet18 and ResNet34 are the backbone models and cannot be embedded in the mobile terminal, based on this, repvgg was chosen. The minimum model is used as the backbone for training, and the finally trained model is very excellent in accuracy and speed. The only disadvantage is that the model weight is a little larger, but this is not a big problem. Int8 quantization can be performed, and the model size can be reduced to a quarter of the original size.

Initially, my deployment model network consisted of: repvgg (deployment) + LSTM + Linear + ctc, and then the painful deployment process began.

2. Model deployment

At first, I checked the operator support documentation of rknn and found that all the operators used in my model were supported. I thought the deployment would be smooth, but it turned out. . .

2.1 Pitfall 1: LSTM operator is not supported

As a long-established operator, LSTM should have been adapted to various frameworks very early. And I checked the documentation of RKNN and it said that it is supported. However, when I converted to RKNN through ONNX, there was a problem and an error was reported, as follows Shown:
Insert image description here
Then I went to rknn’s official forum and searched, and found that they were all conversion error problems, and there was no solution. Isn’t this nonsense?
Insert image description here
Finally, I went to their Q group to ask, but there was no result. I was asked to go to redmine to ask. I don’t know where it is. Forget it, let’s think of other ways to solve it.

Can the LSTM layer be removed? The answer is yes. The more typical solution is densenet ocr, which removes lstm and retains the original structure. The final recognition effect is also very good. However, the disadvantage is that the sequence information before and after characters is missing, and some characters may be recognized incorrectly.

Then the above structure was modified, the LSTM layer was removed, only: repvgg + Linear + ctc was retained, and then retrained.

During the training process, I found that the model converged faster after removing lstm, and the accuracy was basically the same. Finally, I got a new model, thinking that the model structure was already very simple, and there would be no problems. The result. . .

2.2 Pitfall 2: Fully connected layers are not quantized

The model conversion went smoothly, but it was found that the quantized model inference speed was extremely slow, as follows:
Insert image description here

Then we looked for the reason, and searched and searched, and found that in the model after quantization conversion, the fully connected layer was very time-consuming. The visualization showed that the fully connected layer was not quantized (dark blue means quantized, light blue means not quantized)? ? I know, this is too confusing, because there are many character categories, so the parameters of the fully connected layer are relatively large, and the speed will be very slow without quantization.

Insert image description here
You can check the time consumption of the output of each layer as follows. You can see that the last two full-link layers take a lot of time:
Insert image description here
so I am thinking about whether I can optimize these two layers.

Since the fully connected layer is not quantized, can I replace the fully connected layer with a fully convolutional layer? The answer is yes, because using a fully convolutional network can also be used for classification. In many target detections, basically all are fully convolutional networks, so I decided to rename the next two layers as the input and output of the fully convolutional network, that is After the backbone network, two layers of 1x1 convolution are added for classification, so the head output is changed to the following structure. Finally, the output after 1x1 convolution is reshaped into the output format we need: and then retrained
Insert image description here
, After testing, it was found that full link and full convolution are used as classification layers. The parameter amount is unchanged and the accuracy is almost unchanged. The key lies in the speed after we use rknn quantization. Without further ado, the quantization process is normal. Let’s go directly. Test Results:
Insert image description here

The input is a 32 448 size image (in addition, the input for the full link layer test above is 32 224), the speed is super fast! ! Inference only takes 5.4ms compared to more than 100 milliseconds before. It just changed two layers and took off directly! ! In addition, look at the time-consuming output of each layer:
Insert image description here
Because each layer has been quantized, the time-consuming of each layer is very even, and the inference speed is basically around 5ms. The accuracy is only a little lower than that without quantization, so this time The optimization was very successful.

3. Write at the end

This article mainly optimizes the deployment of crnn on rknn. Regarding the removal of the LSTM layer, I personally feel that although LSTM has strong timing information, recognition in most situations does not actually need to be too strong if it is not a particularly complex scene. The timing information is combined with my business scenario, so if I remove it here, it will not have much impact. On the contrary, the training speed will converge faster and the accuracy will be higher. Of course, the generalization ability may not be as good as with LSTM. This can be weighed by your own experiment. The second is 1x1 convolution as the classification layer instead of full link. When I tested this, I found that there was not much difference in accuracy. Moreover, I found that there seemed to be some problems with the int8 quantization of ncnn’s full link layer, so if necessary, it was renamed 1x1 convolution. Creating a classification layer will greatly improve the speed of deployment.

Guess you like

Origin blog.csdn.net/qq_39056987/article/details/122128191