[ZCU102 Embedded Development] Sharing the deployment process of the Vitis-AI-based yolov5 target detection model on the ZCU102 development board

Sharing of the deployment process of the yolov5 target detection model based on Vitis-AI on the ZCU102 development board

foreword

I originally wanted to do Vitis-AI development based on ZCU106, but the official lack of relevant documentation for 106, and the yolov5 model I need to transplant needs to use Vitis-AI version 2.0 or later to support the updated pytorch version, corresponding It is also necessary to update the version of tools such as Vitis, so in the absence of reference materials, I chose to find a laboratory and replace it with a ZCU102 development board to go through the basic process first. This blog records the entire process of my transplanting the yolov5 model.

development environment

Hardware environment: Zcu102 development board
PC operating system: Ubuntu18.04.4 (Wrong Ubuntu version will cause Xilinx-related software to report various strange errors. The Ubuntu version supported by Xilinx-related tools is explained in various technical documents. Classic counterexample: Ubuntu18.04.6 is not a system supported by Xilinx, but it is an Ubuntu18 system automatically downloaded from the official website)
PC target detection model operating environment: Pytorch1.8.0 + Cuda11.1
PC Xilinx related development environment: Vitis2022.1 + Petalinux2022.1 + Xilinx Runtime2022.1 + Vitis-AI2.5.0
target detection model: Yolov5 (version 6.0)

overall process

The overall process of model transplantation is as follows:
Model migration flow chart

1. Model training

Before training, first consult the DPUCZDX8G product guide corresponding to Zcu102 , and learn that the neural network operators supported by the DPU are shown in the figure below (the document also has restrictions on the input and output sizes of each operator, which are not listed here. If you have your own If you change the operator of the yolov5 model, please make a detailed comparison with the content in it):
insert image description here
Since the activation function of version 6.0 of yolov5 is already a SiLU function, and the DPU does not support this activation function, in the custom OP of Vitis-AI The SiLU function should be able to be implemented in the function, but I haven't figured it out yet, so here I replaced the SiLU activation function in the model with the LeakyReLU function of the old version of the yolov5 model. The specific files that need to be modified are the common.py and experimental.py files, which are modified as follows. I modified a total of 3 activation functions to solve the problem of error reporting due to the SiLU activation function during quantization.

# 修改前
self.act = nn.SiLU
# 修改后
self.act = nn.LeakyReLU(0.1, inplace=True)

After modifying the activation function, you only need to train according to the normal method of the yolov5 model. Get a yolov5 model with target detection capabilities for your own data set.

2. Model Quantization

The whole process of model quantification is mentioned in the UG1414 document
insert image description here
, and the flow chart is as follows: At the same time, it is mentioned in the document that it needs to be done when quantifying the user-defined model:
insert image description here
here it is necessary to analyze the feature extraction process of the yolov5 model from the code level, the whole The feature extraction process directly uses the relevant operators of the torch tensor of pytorch to process the data, but in the detection layer, there is a piece of code that processes the final three-layer features without using the relevant operators of the torch tensor, so When quantifying the model, you need to comment out this piece of code and add it to the detection function . The code is located in the Detect class of the yolo.py file as follows:

    def forward(self, x):
        z = []  # inference output
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x[i](bs,self.no * self.na,20,20) to x[i](bs,self.na,20,20,self.no)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            if not self.training:  # inference
                if self.onnx_dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

                y = x[i].sigmoid() # (tensor): (b, self.na, h, w, self.no)
                if self.inplace:
                    y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
                    y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
                else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
                    xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
                    wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
                    y = torch.cat((xy, wh, y[..., 4:]), -1)
                z.append(y.view(bs, -1, self.no)) # z (list[P3_pred]): Torch.Size(b, n_anchors, self.no)

        return x if self.training else (torch.cat(z, 1), x)

After modification, it looks like this:

    def forward(self, x):
        z = []  # inference output
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x[i](bs,self.no * self.na,20,20) to x[i](bs,self.na,20,20,self.no)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
        return x

When quantifying, you need to write this code until it is added to the quantized model output before you can continue to use yolov5’s subsequent feature analysis to get the target detection result. This code also uses the _make_grid function in the Detect class. It needs to be written into the quantization program, as shown in the figure below. The main purpose here is to move out all the relevant parameters used in the Detect class. If your custom yolov5 model changes these parameters, it needs to be changed accordingly.

        # 模型推理
        x=model(im) # 这里的model已经是量化后的模型了,x代表量化后模型的输出

        nc = 11  # 1
        no = nc + 5 + 180
        anchors = [[1.25, 1.625, 2, 3.75, 4.125, 2.875], [1.875, 3.8125, 3.875, 2.8125, 3.6875, 7.4375], [3.625, 2.8125, 4.875, 6.1875, 11.65625, 10.1875]]
        nl = 3  # number of detection layers
        na = 3  # number of anchors
        grid = [torch.zeros(1).to(device)] * nl  # init grid
        anchors = torch.tensor(anchors).float().to(device).view(nl, -1, 2)
        anchor_grid=[torch.zeros(1).to(device)] * nl
        stride = [8, 16, 32]

        z = []
        for i in range(nl):
            bs, _, ny, nx, _no = x[i].shape
            if grid[i].shape[2:4] != x[i].shape[2:4]:
                    grid[i], anchor_grid[i] = _make_grid(anchors, stride, nx, ny, i)

            y = x[i].sigmoid() # (tensor): (b, self.na, h, w, self.no)
            y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + grid[i]) * stride[i]  # xy
            y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * anchor_grid[i]  # wh
            z.append(y.view(bs, -1, no)) # z (list[P3_pred]): Torch.Size(b, n_anchors, self.no)
        out, train_out = torch.cat(z, 1), x
def _make_grid(anchors,stride,nx=20, ny=20, i=0):
    d = anchors[i].device
    shape = 1, 3, ny, nx, 2
    y, x = torch.arange(ny, device=d), torch.arange(nx, device=d)
    yv, xv = torch.meshgrid([y, x])
    grid = torch.stack((xv, yv), 2).expand(shape).float()  # add grid offset, i.e. y = 2.0 * x - 0.5
    anchor_grid = (anchors[i].clone() * stride[i]).view((1, 3, 1, 1, 2)).expand(shape).float()
    return grid, anchor_grid

After adjusting the yolov5 model according to the requirements of the official document, then refer to the official pytorch model quantization code (minst dataset handwriting recognition) to write a quantization script. Quantization is divided into two steps. The first step is to generate a quantization setting file:

from pytorch_nndct.apis import torch_quantizer

	# 加载yolov5模型
    model = DetectMultiBackend(file_path)

    input = torch.randn([1, 3, 1024, 1024],device=device)
    quantizer = torch_quantizer(
        quant_mode, model, (input), device=device,bitwidth=8)

    quant_model = quantizer.quant_model
    quant_model = quant_model.to(device)

	# 运行量化后模型,evaluate函数参考yolov5的val.py和之前提到的特征处理部分做修改即可
    print(evaluate(model=quant_model))

    # 生成量化设置文件
    quantizer.export_quant_config()

The second step generates the quantized xmodel model:

from pytorch_nndct.apis import torch_quantizer

	# 加载yolov5模型
    model = DetectPrunedMultiBackend(file_path)

    input = torch.randn([1, 3, 1024, 1024],device=device)
    quantizer = torch_quantizer(
        quant_mode, model, (input), device=device,bitwidth=8)

    quant_model = quantizer.quant_model
    quant_model = quant_model.to(device)

    print(evaluate(model=quant_model))

    # 生成xmodel模型
    if deploy:
        quantizer.export_xmodel(deploy_check=False)

These two pieces of code are actually very close, mainly because the official pytorch model quantization code also has part of the content for quick fine-tuning of the quantized model, which is placed in these two steps. The first step is used to train the quantized model. Model and quickly fine-tune the quantized parameters, and the second step directly reads the parameters saved in the first step to generate an xmodel file. However, due to the lack of a description of the API function, I haven't figured out how to adjust the model training loss here. In addition, the changes I added to my own yolov5 model have greatly changed the loss function, so I put it on hold for now. Added the quick fine-tuning function, if anyone knows how to use it, please let me know in the comment area.

After writing the relevant python script, it needs to run in the vitis-AI docker environment. I am using the latest vitis-AI2.5, and the docker image is the cpu version. Run the model quantization script in the pytorch environment in docker to get a pre-compiled xmodel file. In the subsequent process, the model needs to be compiled into the DPUCZDX8G version corresponding to the ZCU102 board. However, in order to better debug the quantization process, I chose to install the python source code of the Vitis-AI quantizer into the conda environment of my Ubuntu computer. The pytorch version of the Vitis-AI quantizer source code is located in this directory . After installing this part in the conda environment, you can use the Vitis-AI quantizer outside the docker, which is convenient for debugging.

3. Model compilation

This step is actually quite easy. In the pytorch environment in docker, use the compiler tool vai_c_xir of the pytorch model to compile the xmodel file generated in the previous step. The directive I am using is shown below. The -x parameter specifies the xmodel file obtained in the previous step, the -a parameter specifies the architecture file of the DPU and the development board, the -o parameter specifies the directory of the output result, and the -n parameter specifies the name of the output model. If no error is reported in this step, a model with 1 dpu subgraph (subgraph) will be obtained. The output is shown in the following figure under correct compilation:

vai_c_xir -x ./DetectMultiBackend_int.xmodel -a /opt/vitis_ai/compiler/arch/DPUCZDX8G/ZCU102/arch.json -o ./ -n model

insert image description here
If you get a model with multiple DPU subgraphs during compilation, it means that your model has not been fully quantized and compiled for two reasons. The first point is that all functions other than the pre-transfer method have not been removed as required. The second point is that there are operators in the model that cannot be recognized by the DPU. Both of these points will cause the model to be disassembled into multiple subgraphs during quantitative compilation. Run Such a model needs to read the output of multiple subgraphs sequentially in the code and supplement the functions/operators that have not been quantified and compiled, which will greatly increase the workload and be very troublesome.
The model I compiled is about 1/3 the size of the model before compilation, for reference only.

4. The development board runs

After getting the compiled model, it is necessary to prepare the environment for the development board to run.
The first step is to lay out the embedded system on the development board. There are many official guides for ZCU102. You can directly download the embedded system image of ZCU102 in the UG1414 document . The image is version 2022.1. A DPU device is added on the PL side, and a driver is also set on the PS side. It is a ready-to-use DPU. Develop embedded environments. After downloading, use the SD card burning tool to burn the image to the SD card, and then make the ZCU102 embedded system boot disk. The development board selects the SD card boot mode, and the PC uses minicom to debug the development board with uart. After configuring the network interface, the development board can be connected to the external network. The second step is to compile torch into the python environment on the development board. Although the compiled model does not use the torch.nn operator to run, the preprocessing and postprocessing parts of the yolov5 code use a lot of related functions that use tensor tensors to process the data. Since there is no time to change it to numpy little by little, So I still chose to compile pytorch to the development board. Here I directly choose to copy the source code to the board and compile it on the board. Follow the process on github to download the source code and git to all the components, use the following instructions when compiling (because the board lacks this and that, so you can’t compile and install torch completely according to the simple instructions on github), the compilation time is about About 6 hours:

git submodule update --remote third_party/protobuf
USE_CUDA=0 USE_MKLDNN=0 USE_QNNPACK=0 USE_NNPACK=0 USE_DISTRIBUTED=0 BUILD_CAFFE2=0 BUILD_CAFFE2_OPS=0 python3 setup.py build
python3 setup.py develop && python3 -c "import torch"

The third step is to install other yolov5 python dependencies. Among these dependencies, only pytorch uses C++, and the others are pure py, so only torch needs to be compiled with the compiler that comes with the development board, and the others can be installed directly with pip to install the whl file. (PS: One thing to complain about is that this image does not contain pip, so you need to install pip yourself first. In this process, you need to use the date command to set the time for the board in advance. It is best to keep it in sync with the date, otherwise download things Occasionally, a strange error will be reported). The fourth step, after the environment is all ready, you only need a test script on the board. There is an official test script
for the pytorch model. You can refer to this script for the relevant APIs used to put the model into the DPU and run it. Use, the relevant code is as follows:

def get_child_subgraph_dpu(graph: "Graph") -> List["Subgraph"]:
    assert graph is not None, "'graph' should not be None."
    root_subgraph = graph.get_root_subgraph()
    assert (root_subgraph is not None), "Failed to get root subgraph of input Graph object."
    if root_subgraph.is_leaf:
        return []
    child_subgraphs = root_subgraph.toposort_child_subgraph()
    assert child_subgraphs is not None and len(child_subgraphs) > 0
    return [
        cs
        for cs in child_subgraphs
        if cs.has_attr("device") and cs.get_attr("device").upper() == "DPU"
    ]
# 读取模型的全部子图(这里量化后只有一个子图),将模型加载到DPU中
g = xir.Graph.deserialize(model)
subgraphs = get_child_subgraph_dpu(g)
all_dpu_runners = []
for i in range(threads):
    all_dpu_runners.append(vart.Runner.create_runner(subgraphs[0], "run"))

Here we need to modify the yolov5 test program val.py according to the input and output format of the DPU model. During the quantization process, the input and output of the model are changed from floating-point numbers before quantization to fixed-point numbers after quantization, and the position of the decimal point is saved in the model. For the convenience of observation, use the netron tool to open the xmodel file to view the model structure.
The input module is as shown in the figure below. There are two very important points here. The first point is that the input data is an 8-bit fixed-point number, and the decimal point is at the seventh place. For this, the input of the yolov5 model is originally a normalized floating-point image data, which needs to be multiplied by 2 After the power of 7 is 128, the data format is changed to 8-bit integer, which realizes the process of converting floating-point numbers to fixed-point numbers. The specific code is as follows. The first piece of code reads the position of the decimal point input by the model, and the second piece is for the input image. do processing. (I directly add the input_scale obtained in the first step to the dataloader class, and do multiplication and format conversion in the next step when dimensionally transforming the image. Please refer to the relevant code in the next point for details)

    # 读取量化后模型对输入的定点数数据的小数点位置,得出在浮点数转定点数时需要乘的系数input_scale
    input_fixpos = all_dpu_runners[0].get_input_tensors()[0].get_attr("fix_point")
    input_scale = 2 ** input_fixpos

The second point is that the dimension of the input image is batchsize×w×h×3 (1×1024×1024×3), and the dimension of the image data we read using dataloader is 1×3×1024×1024. So you need to modify the dataloader class to make the input conform to the input of the DPU model, modify the __getitem__ method in the LoadImagesAndLabels class of datasets.py, and add such a piece of code at the end to modify the dimension and data storage format.

insert image description here

    # 将图像维度调整到DPU要求的定点输入
    img = torch.from_numpy(img)
    img = img.permute(1, 2, 0).float().numpy() / 255 * self.inputscale + 0.5
    img = img.astype(np.int8)

The output has a total of three feature layers. Here is an example of the smallest layer:
insert image description hereinsert image description here

Key points: 1. The output of the DPU is 1×32×32×588 at the download point, not 1×3×32×32×196 at the fix node, so in subsequent processing, we need to convert 1×32×32
× 588 is converted to 1×3×32×32×196, here you need to refer to the process of the original yolov5 model, first convert 1×32×32×588 to 1×588×32×32, and then convert to 1×3×196× 32×32, finally converted to 1×3×32×32×196.
2. In the fix node, you can see that the output is an 8-bit signed fixed-point number, and the decimal point is the third digit. In fact, the storage format is integer. Therefore, in subsequent processing, the integer data needs to be converted into floating-point data and divided by 2 to the third power of 8 before it can be used for subsequent post-processing such as NMS.
3. The decimal points of the fixed-point data of the three different feature layers may be different! ! ! ! Here I have two layers with 3 decimal places and one with 4, so you must pay attention to this.
The specific code is as follows:

    output[0] = (output[0].float() / 8).permute(0, 3, 1, 2).view(1, 3, 196, 128, 128).permute(0, 1, 3, 4, 2)
    output[1] = (output[1].float() / 8).permute(0, 3, 1, 2).view(1, 3, 196, 64, 64).permute(0, 1, 3, 4, 2)
    output[2] = (output[2].float() / 16).permute(0, 3, 1, 2).view(1, 3, 196, 32, 32).permute(0, 1, 3, 4, 2)

epilogue

After completing these steps, the target frame of the target detection can be parsed on the ZCU102 development board. The current feature extraction rate can reach 30fps under the premise that the input is a 1024×1024 image, and the detection performance is not greatly affected. , which is considered to have reached a staged goal. And those shared in the blog are almost the main pitfalls I have stepped on in this process. The key reason for the pitfalls is that the relevant manuals have too few descriptions of the dimensions and formats of these input and output, and I need to go to each step. I used various tools to look over and over again, and then figured out the meaning of the few codes that were basically uncommented by the official. Although this process was very troublesome, it also helped me deepen my understanding of the data processing of the yolov5 model. It's time for something. The complete code cannot be made public here for some reasons, so if you have any understanding, I hope you can ask questions directly in the comment area, and I will do my best to communicate and learn with you. I will continue to do some work on this part in the future, and I hope that friends with the same goal can speak and discuss more and give advice to each other.

Guess you like

Origin blog.csdn.net/qq_36745999/article/details/126981630