Using NCNN to call GPU for inference on jetson TX2

Preface

I recently got a jetson TX2 board. Normally, Tensorrt is undoubtedly the first choice when deploying nvidia series products, because I have written about the project of ncnn deployed on arm boards before. At that time, I used rk series boards, which could not be used. The main reason for calling the GPU on the board for inference is the issue of Vulkan support. It happens that the Jetson series boards are arm linux systems and support Vulkan driver, so in theory it is possible to call the GPU for reasoning, although currently ncnn Vulkan supports The op operator and performance are not very friendly. I heard that up is already optimizing this area. Out of reverie about the inference speed of ncnn using gpu, I decided to try to compare the inference speed of ncnn using cpu and gpu based on the previous project. .

1. ncnn vulkan

When ncnn calls the GPU, it must have vulkan support, so make sure that the vulkan SDK has been installed correctly before trying it. Of course, on the jetson series boards, after the system is refreshed, vulkan is installed by default and can be used Use the vulkaninfo command to check. For vulkan support on other platforms, you can search in the vulkan supported device list
Insert image description here
: Check the support of tx2, you can see:
Insert image description here
vulkan SDK package download address for other platforms: https://vulkan.lunarg .com/sdk/home

2. Compile ncnn

After ensuring that the platform supports vulkan, the next step is to compile ncnn. Because the board has a cmake compilation mechanism, it can be compiled directly on the board without cross-compiling. Enter the terminal command line of the tx2 system and use the following command to compile:

git clone https://github.com/Tencent/ncnn.git
cd ncnn
git submodule update --init

cd ncnn
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=../toolchains/jetson.toolchain.cmake -DNCNN_VULKAN=ON -DNCNN_BUILD_EXAMPLES=ON ..
make -j$(nproc)
make install 

Because there is a previous project code foundation, after compilation, you only need to pay attention to the include and lib folders in the build/install directory. Here, you need to replace the files when vulkan was not compiled before, which will be pointed out in the CMakeList.txt file.

After compilation, you can copy all the weight files in the ncnn/benchmark directory to ncnn/build/benchncnn for benchmark testing. The first is the CPU operation. Use the command ./benchncnn 1 1 0 -1: Then look at the situation when using the GPU
Insert image description here
. , use the command ./benchncnn 1 1 0 0, you can see that the speed improvement is still great, but it is obviously not friendly to int type reasoning. It should be because the GPU device does not support int reasoning, and the operation falls back to the CPU. , but overall the acceleration is still obvious:
Insert image description here

3. vulkan call and CMakeList.txt modification

vulkan calling is very simple, just add one line of code:

ncnn::Net net;
net.opt.use_vulkan_compute = true;                 //  添加这行               

Then modify the CMakeList.txt compilation file, link the header files, library files, and static libraries, and modify them to the corresponding folders in the install path after ncnn compilation just now. According to your own path: I thought it would compile smoothly
Insert image description here
. , but the following error was reported during compilation:
Insert image description here
After struggling for a long time, I finally found the solution. Modify the /lib/cmake/ncnn/ncnn.cmake file, comment out this line and recompile:
Insert image description here
Take a table for comparison. Let’s look at the CPU reasoning situation first:
Insert image description here
Insert image description here
Insert image description here
ID card recognition CPU time:
Insert image description here
Then look at the GPU reasoning situation:
Insert image description here
the acceleration part of reasoning is mainly on the text detection model. The recognition model is very fast, so it is close to a bottleneck. The detection speed is fast. A lot has been learned, but why is the inference of the recognition model wrong? ? ? ? I didn't find out the problem either. . . Is it because of LSTM? It feels like some layers are not supported. Fortunately, I still have a self-trained recognition model. For the specific recognition model, you can see the model optimization structure deployed on rknn in my previous article.

My recognition model is a pure convolutional structure, and it is repvgg, so there is no model more concise than this. It is really easy to deploy and there is no need to worry about the problem that some layers are not supported during conversion. So here is the effect of the last recognition model I trained myself:
Insert image description here
First of all, the above table, you can see that the detection and recognition speed have been greatly improved, and then look at the ID card recognition situation:
Insert image description here
Comparison with the above CPU , the speed increase is still quite large.

However, compared with trt, this speed increase should still be relatively poor. After all, ncnn is still targeting the CPU optimization of the arm platform. I will publish an article later to compare the speed of TX2 with trt.

4. Other instructions for ncnn vulkan

Because ncnn is not very good at optimizing this area, you will inevitably encounter some problems during use. For some common problems, please see the documentation on this chapter in ncnn.

https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-vulkan

Guess you like

Origin blog.csdn.net/qq_39056987/article/details/124476329
Recommended