3.8.cuda runtime API - use cuda kernel function to accelerate yolov5 post-processing

foreword

Teacher Du launched the tensorRT high-performance deployment course from scratch . I have read it before, but I didn’t take notes, and I forgot many things. This time I will do it again, and take notes by the way.

This course learns the streamlined CUDA tutorial - using the cuda kernel function to accelerate the post-processing of yolov5

The course outline can be seen in the mind map below

insert image description here

1. Yolov5 post-processing

Yolov5 is a more classic model in target detection, and it is very necessary to learn to decode its post-processing. Here we only use the kernel function to decode the results of Yolov5 reasoning and restore them into boxes, grasp the problems solved by post-processing, and consider performance.

Speaking from experience :

For post-processing code research, you can convert PyTorch data into numpy, then write tobytes to a file, and then read it in C++, which can quickly conduct problem research and troubleshooting. At this time, tensorRT reasoning is not required. Do post-processing research. This is also called variable control method

fast_nms_kernel will reduce frames in extreme cases, but this extreme case generally does not appear, and the actual measurement has almost no effect

Fast nms is relatively simple and efficient in cuda implementation, without sorting

2. Post-processing case

Let's take a look at the entire post-processing process of Yolov5: decode + nms

Since the whole post-processing process can be a bit complicated, we can do it on the CPU first and then consider the work on the GPU.

In order to facilitate the demonstration of the entire post-processing process, we use PyTorch to perform inference, save the inference result with numpy, and then use c++ to read it for post-processing. At the same time, we can also see whether the final result of PyTorch is consistent with our post-processing result .

The code for saving inference results in numpy is as follows:
with open("../workspace/predict.data", "wb") as f:
   f.write(pred.cpu().data.numpy().tobytes())
The input of Yolov5 on the COCO dataset is a tensor of dimension [n, 85], where 85 is [cx, cy, width, objectness, classification * 80]

For post-processing principles and more details, please refer to the detailed explanation of YOLOv5 reasoning and preprocessing high-performance implementation

2.1 cpu_decode

Let's look at cpu_decode first. The key points of CPU decoding are:

To avoid redundant calculations, you need to know that some mathematical operations require far more events than many ifs, and reducing their times is the key to improving performance

The implementation of nms can be optimized, such as remove_flags and pre-allocate memory, reserve allocates memory for output

The core code is as follows:

vector<Box> cpu_decode(float* predict, int rows, int cols, float confidence_threshold = 0.25f, float nms_threshold = 0.45f){
    
    
    
    vector<Box> boxes;
    int num_classes = cols - 5;
    for(int i = 0; i < rows; ++i){
    
    
        float* pitem = predict + i * cols;
        float objness = pitem[4];
        if(objness < confidence_threshold)
            continue;

        float* pclass = pitem + 5;
        int label     = std::max_element(pclass, pclass + num_classes) - pclass;
        float prob    = pclass[label];
        float confidence = prob * objness;
        if(confidence < confidence_threshold)
            continue;

        float cx     = pitem[0];
        float cy     = pitem[1];
        float width  = pitem[2];
        float height = pitem[3];
        float left   = cx - width * 0.5;
        float top    = cy - height * 0.5;
        float right  = cx + width * 0.5;
        float bottom = cy + height * 0.5;
        boxes.emplace_back(left, top, right, bottom, confidence, (float)label);
    }

    std::sort(boxes.begin(), boxes.end(), [](Box& a, Box& b){
    
    return a.confidence > b.confidence;});
    std::vector<bool> remove_flags(boxes.size());
    std::vector<Box> box_result;
    box_result.reserve(boxes.size());

    auto iou = [](const Box& a, const Box& b){
    
    
        float cross_left   = std::max(a.left, b.left);
        float cross_top    = std::max(a.top, b.top);
        float cross_right  = std::min(a.right, b.right);
        float cross_bottom = std::min(a.bottom, b.bottom);

        float cross_area = std::max(0.0f, cross_right - cross_left) * std::max(0.0f, cross_bottom - cross_top);
        float union_area = std::max(0.0f, a.right - a.left) * std::max(0.0f, a.bottom - a.top) 
                         + std::max(0.0f, b.right - b.left) * std::max(0.0f, b.bottom - b.top) - cross_area;
        if(cross_area == 0 || union_area == 0) return 0.0f;
        return cross_area / union_area;
    };

    for(int i = 0; i < boxes.size(); ++i){
    
    
        if(remove_flags[i]) continue;

        auto& ibox = boxes[i];
        box_result.emplace_back(ibox);
        for(int j = i + 1; j < boxes.size(); ++j){
    
    
            if(remove_flags[j]) continue;

            auto& jbox = boxes[j];
            if(ibox.label == jbox.label){
    
    
                // class matched
                if(iou(ibox, jbox) >= nms_threshold)
                    remove_flags[j] = true;
            }
        }
    }
    return box_result;
}

The code can be mainly divided into two parts: preprocessing result decoding and non-maximum value suppression

Prediction result decoding :

First traverse each prediction box, and filter the prediction results through the confidence threshold (confidence_threshold). Then, the category of the predicted box is calculated, and the category with the highest probability among the 80 categories is selected as the label of the predicted box. Next, convert the center point and width and height of the prediction frame into coordinates of the upper left corner and lower right corner, and save the information of the prediction frame boxesinto

Non-Maximum Suppression (NMS) :

First of all, we need to boxessort all the prediction boxes in descending order according to the confidence, so as to facilitate the subsequent NMS operation. The implementation of NMS is mainly realized through the remove_flags flag, which saves the prediction boxes that are not marked as needing to be removed box_resultin

Key performance optimization points :

Prediction box filtering, in the decoding process, first use the confidence threshold to filter, avoiding unnecessary subsequent calculation and processing
Predict box sorting, pass references in lambda functions, and use reverse to pre-allocate box_result to improve performance
Use the flag bit: In the NMS process, use remove_flagsthe flag bit to mark the prediction frame that needs to be removed, which improves the efficiency compared with the pairwise prediction frame

2.2 gpu_decode

Let's look at gpu_decode again. The key points of GPU decoding are:

Indicates an array with an uncertain number of outputs, using the method of [count, box1, box2, box3], and a maximum number limit is required at this time

Implement the addition of array elements through atomicAdd and return the index

Like cpu_decode, try to save unnecessary calculations

The decode core code is as follows:

static __global__ void decode_kernel(
    float* predict, int num_bboxes, int num_classes, float confidence_threshold, 
    float* invert_affine_matrix, float* parray, int max_objects, int NUM_BOX_ELEMENT
){
    
      
    int position = blockDim.x * blockIdx.x + threadIdx.x;
    if (position >= num_bboxes) return;

    float* pitem     = predict + (5 + num_classes) * position;
    float objectness = pitem[4];
    if(objectness < confidence_threshold)
        return;

    float* class_confidence = pitem + 5;
    float confidence        = *class_confidence++;
    int label               = 0;
    for(int i = 1; i < num_classes; ++i, ++class_confidence){
    
    
        if(*class_confidence > confidence){
    
    
            confidence = *class_confidence;
            label      = i;
        }
    }

    confidence *= objectness;
    if(confidence < confidence_threshold)
        return;

    int index = atomicAdd(parray, 1);
    if(index >= max_objects)
        return;

    float cx         = *pitem++;
    float cy         = *pitem++;
    float width      = *pitem++;
    float height     = *pitem++;
    float left   = cx - width * 0.5f;
    float top    = cy - height * 0.5f;
    float right  = cx + width * 0.5f;
    float bottom = cy + height * 0.5f;
    // affine_project(invert_affine_matrix, left,  top,    &left,  &top);
    // affine_project(invert_affine_matrix, right, bottom, &right, &bottom);

    // left, top, right, bottom, confidence, class, keepflag
    float* pout_item = parray + 1 + index * NUM_BOX_ELEMENT;
    *pout_item++ = left;
    *pout_item++ = top;
    *pout_item++ = right;
    *pout_item++ = bottom;
    *pout_item++ = confidence;
    *pout_item++ = label;
    *pout_item++ = 1; // 1 = keep, 0 = ignore
}

The above gpu_decode code is very similar to the cpu processing, where the number of threads started by the kernel function is the number of prediction frames, and each thread handles the decoding of a frame, position represents the Idx of the current thread, which is the first address of all prediction frames, and is the *predictcurrent pitemthread The starting address of the prediction box to be processed, as shown in the following figure:

insert image description here

Figure 2-1 Pitem

At the same time, in order to save the predicted frame after decoding, we use the atomic add (atomicAdd) operation to avoid conflicts when multiple threads write to the output array at the same time, which can ensure the accuracy of the results. Specifically, index = atomicAdd(parray, 1)it means parrayadding 1 to the value of the memory location pointed to by and assigning the value before the addition toindex , while index means the index value of the currently processed bounding box in all bounding boxes . In order to avoid exceeding the maximum number of bounding boxes, it will return directly when the index exceeds MAX_IMAGE_BOXES, and the bounding box will no longer be processed.

After the predicted frame is decoded, the decoded frame information needs to be saved. The first element of the saved address is the *parraynumber parrayof saved frames, followed by the information of each frame, as shown in the figure below .

insert image description here

Figure 2-2 pout_item

Of course, you can also use cuda to join nsm, the code is as follows:

static __global__ void fast_nms_kernel(float* bboxes, int max_objects, float threshold, int NUM_BOX_ELEMENT){
    
    

    int position = (blockDim.x * blockIdx.x + threadIdx.x);
    int count = min((int)*bboxes, max_objects);
    if (position >= count) 
        return;
    
    // left, top, right, bottom, confidence, class, keepflag
    float* pcurrent = bboxes + 1 + position * NUM_BOX_ELEMENT;
    for(int i = 0; i < count; ++i){
    
    
        float* pitem = bboxes + 1 + i * NUM_BOX_ELEMENT;
        if(i == position || pcurrent[5] != pitem[5]) continue;

        if(pitem[4] >= pcurrent[4]){
    
    
            if(pitem[4] == pcurrent[4] && i < position)
                continue;

            float iou = box_iou(
                pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3],
                pitem[0],    pitem[1],    pitem[2],    pitem[3]
            );

            if(iou > threshold){
    
    
                pcurrent[6] = 0;  // 1=keep, 0=ignore
                return;
            }
        }
    }
}

fast_nms_kernel will have fewer frames in extreme cases. For example, when there are multiple overlapping frames and they have the same confidence, due to the conditional judgment and parallel computing characteristics in the kernel function, the latter frame may cover the previous frame. Thus causing the preceding box to be ignored.

It is worth noting that when testing the performance of mAP, only the CPU version of nms can be used, because the mAP test needs to accurately calculate the overlap of each frame, and it needs to be sorted and suppressed according to a specific algorithm. However, the nms method for parallel computing on the GPU often sacrifices certain accuracy and cannot meet the requirements of the mAP test.

The figure below compares the effect of PyTorch with the effect of our own post-processing, and we can see that the result is no problem

insert image description here

Figure 2-3 PyTorch effect

insert image description here

Figure 2-4 Custom implementation of post-processing effects

Summarize

In this course, we learned the post-processing of the classic target detection algorithm Yolov5. We implemented the entire decode on the cpu first. The performance of the cpu version is already very high, and it is suitable for running on some edge embedded devices. Then we based on the cpu version. decode has written a kernel function to speed up the entire decoding process. Many things still need to be done by yourself and try more.

For more discussion on the code, please refer to yolo.cu for infer source code reading