Tensorrt accelerated K-Net notes


Since the advent of detr, the ideas of set prediction and bipartite matching have conquered the city in the visual detection task, which has the potential to completely eliminate nms and realize the real end-to-end. Among them, the effect of instance segmentation/panoramic segmentation algorithms such as maskformer and K-net is very eye-catching, so can tensorrt be used to achieve half-precision acceleration, so that it can be deployed in real time or even to the edge computing terminal? The answer is yes. Today, let's try an experiment that uses tensorrt to accelerate K-Net to within 20ms.

pytorch to onnx

It is relatively simple for Knet to transfer from pytorch to onnx. Since there are basically no layers that onnx does not support, it is necessary to pay attention to converting torch.einsum in the code into various torch.matmul in advance, and then use onnxsim to simplify it and you're done.

onnx to tensorrt

We directly use the trtexec that comes with tensorrt to convert from onnx to tensorrt fp16. As a result, tensorrt 7.1 failed and reported an error, which seemed to be a problem with the instance norm layer. Out of the idea of ​​not wanting to waste any effort on the bugs of the old version of tensorrt, I directly upgraded tensorrt to the latest 8.2 and then converted, and the result was very successful.

 [I] Latency: min = 15.2039 ms, max = 19.5652 ms, mean = 15.8112 ms, median = 15.7288 ms, percentile(99%) = 19.3462 ms
 [I] End-to-End Host Latency: min = 15.4692 ms, max = 19.8303 ms, mean = 16.0644 ms, median = 15.9871 ms, percentile(99%) = 19.6008 ms
 [I] Enqueue Time: min = 12.0681 ms, max = 16.5934 ms, mean = 12.4658 ms, median = 12.3762 ms, percentile(99%) = 16.0362 ms
 [I] H2D Latency: min = 1.15027 ms, max = 1.24194 ms, mean = 1.21364 ms, median = 1.21387 ms, percentile(99%) = 1.23401 ms
 [I] GPU Compute Time: min = 11.8992 ms, max = 16.2252 ms, mean = 12.4986 ms, median = 12.4148 ms, percentile(99%) = 16.0308 ms
 [I] D2H Latency: min = 2.09253 ms, max = 2.12573 ms, mean = 2.09892 ms, median = 2.0979 ms, percentile(99%) = 2.1109 ms
 [I] Total Host Walltime: 2.46604 s
 [I] Total GPU Compute Time: 2.46222 s
 [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
 [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
 [I] Explanations of the performance metrics are printed in the verbose logs.

We can see that on the 3090, the average speed of K-Net tested by trtexec reached an astonishing 15ms.

Am I the one who turned it? Is the poor precision of FP16 a problem of bilinear aglin_corner=False?

Seems like this blog post ends here? The fact is not, we use the converted K-Net to randomly find a picture to test and output the result. . . Seems like something's wrong with the mask?

So I converted K-Net to FP32 precision again. Well, it seems that the mask result is correct, but the average speed is only 41ms, which is similar to the effect before acceleration.

insert image description here

Where is the problem? Based on years of experience in converting models, I focused on the resize layer, which is actually the upsampling layer interpolate in pytorch. In the era of TensorRT6.0 in 2019, someone reported that when resize uses the bilinear algorithm and aglin_corner=False time bug

Steps To Reproduce
build_engine build a network with only one layer, IReizeLayer
compare would print the trt result and torch result
got different result with align_corner=False

scale_factor: 2
align_corners: False
torch.Size([1, 2, 2])
build_engine, scale_factor 2 align_corners False
1
<tensorrt.tensorrt.ILayer object at 0x7fb423d4c848> LayerType.RESIZE
[TensorRT] WARNING: Tensor DataType is determined at build time for  tensors not marked as input or output.
[TensorRT] INFO: Detected 1 inputs and 1 output network tensors.
>>>>> test_bilinear2d.trt
compare, scale_factor 2 align_corners False
----------torch output
tensor([[[[-0.2651, -0.1464,  0.0908,  0.2095],
          [-0.0041, -0.0402, -0.1124, -0.1485],
          [ 0.5179,  0.1723, -0.5188, -0.8644],
         [ 0.7789,  0.2786, -0.7220, -1.2223]]]], device='cuda:0')
==========trt output
tensor([[[-0.2651, -0.0278,  0.2095,  0.2095],
         [ 0.2569, -0.1248, -0.5064, -0.5064],
         [ 0.7789, -0.2217, -1.2223, -1.2223],
         [ 0.7789, -0.2217, -1.2223, -1.2223]]], device='cuda:0')

but setting align_corners=True, we got same result

Has Lao Huang not fixed the bug for so many years? So I changed all align_corners=False in the K-Net source code to align_corners=True to regenerate onnx, and it should be successful now. But the road to the light is always tortuous. After changing align_corners to True, the results of K-Net under FP32 and FP16 are still different, and the performance is exactly the same as align_corners=False. Is it not a problem with the resize layer? It’s my fault Are you old?

The culprit, FP16 precision overflow

Enter '''TensorRT FP16''' directly in the browser search bar, and the first article that appeared attracted me. This post mentions why the FP16 model fails:

The high probability is that the calculation value of an op node overflows due to insufficient dynamic range and precision of the calculation FP16 of a certain layer in the model. Then it moved the whole body with a single hair, and all the layers behind the entire model collapsed.

At the same time, I noticed that Tensorrt 8 comes with a model analysis tool polygraphy, among which convert can save the construction method of each layer of the tensorrt model in a json, and can use this json to guide the new model to adopt a completely consistent construction method. So an idea appeared in my mind, we can manually modify the accuracy range of each layer recorded in replay.json, and then use this json to guide the generation of a new model, so that the layer with overflowing precision can be calculated under FP32, then the problem Isn't it solved! So which layer is the layer of precision overflow? We might as well open the replay.json generated by K-Net to have a look. In addition to various conv, reshape, mul, add, resize and relu layers, there is a Foreignnode layer that looks suspicious!

        "{ForeignNode[Conv_623...Reshape_1553]}": {
            "implementation": -2147483609,
            "tactic": 0,
            "inputs": [
                [
                    "TensorFormat.LINEAR",
                    "DataType.HALF"
                ],
                [
                    "TensorFormat.LINEAR",
                    "DataType.HALF"
                ],
                [
                    "TensorFormat.LINEAR",
                    "DataType.HALF"
                ],
                [
                    "TensorFormat.LINEAR",
                    "DataType.HALF"
                ],
                [
                    "TensorFormat.LINEAR",
                    "DataType.HALF"
                ],
                [
                    "TensorFormat.LINEAR",
                    "DataType.HALF"
                ]
            ],
            "outputs": [
                [
                    "TensorFormat.LINEAR",
                    "DataType.HALF"
                ],
                [
                    "TensorFormat.LINEAR",
                    "DataType.HALF"
                ]
            ],
            "polygraphy_class": "Algorithm"
        },

insert image description here

From the name, this Foreignnode seems to integrate a large part of the rear of the model, so it's up to you! Change all the input and output of Foreignnode to FLOAT, and then use the modified replay.json to guide the generation of tensorrt model again.
We tested again, and the K-Net model under FP16 finally got the correct result, and the speed did not change much from the unmodified FP16 model, only 1ms slower.

 [I] Latency: min = 16.1807 ms, max = 18.1675 ms, mean = 16.6139 ms, median = 16.5773 ms, percentile(99%) = 17.1135 ms
 [I] End-to-End Host Latency: min = 17.1362 ms, max = 19.1318 ms, mean = 17.5773 ms, median = 17.5393 ms, percentile(99%) = 18.119 ms
 [I] Enqueue Time: min = 12.9159 ms, max = 14.7864 ms, mean = 13.2425 ms, median = 13.205 ms, percentile(99%) = 13.754 ms
 [I] H2D Latency: min = 1.19037 ms, max = 1.26184 ms, mean = 1.23727 ms, median = 1.23682 ms, percentile(99%) = 1.25903 ms
 [I] GPU Compute Time: min = 12.845 ms, max = 14.8296 ms, mean = 13.2777 ms, median = 13.2403 ms, percentile(99%) = 13.826 ms
 [I] D2H Latency: min = 2.09375 ms, max = 2.11121 ms, mean = 2.0989 ms, median = 2.09802 ms, percentile(99%) = 2.10938 ms
 [I] Total Host Walltime: 2.47392 s
 [I] Total GPU Compute Time: 2.46965 s
 [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
 [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
 [I] Explanations of the performance metrics are printed in the verbose logs.

Of course, just observing the mask with the naked eye does not mean that the tensorrt accelerated model must be correct. If you want to ensure that the output of the model is the same as before the acceleration, you need to directly compare the relative error and absolute error of the output result, but this is not the gist of this article.

------------------renew---------------------

Through further testing on jetson Xavier, I found some problems with Tensorrt: the performance of Knet on Xavier and 3090 is not consistent, even the FP32 Knet model may not output results at all on Xavier, after a few days of tossing, It is concluded that the way Tensorrt aggregates layers in Xavier and 3090 is inconsistent. Tensorrt aggregates layernorm and multi head attention in Xavier (especially the softmax, pow, sqrt, etc. op) in the wrong way, which leads to completely wrong calculation results, probably nvidia’s No one thought that someone would want to use a transformer-like model on jetson. Finally, in the process of converting Knet's onnx, replace all layernorms with instancenorm, and force the second half of the model to be FP32 precision. Finally, the correct result can be obtained in xavier, with a speed of 150ms.

Guess you like

Origin blog.csdn.net/blanokvaffy/article/details/121723023