【转载】yolov5 tensorrt 精度对齐总结

最近用yolo做项目需要用到tensorrt加速。但是测试发现,engine格式文件进行C++ tensorrt推理的结果,与yolo里面用pt推理的结果相差颇大,有的图片结果一样,有的置信度相差大,有的目标在trt没有预测出来。
我的engine是按照 pt --> wts --> engine 方式生成的。


注:我只改了两个地方:C++参数改为跟torch一样、修改BN层eps为1e-3;
我输入的图片都是正方形,所以其它的坐标处理没有改。博主用的是长方形图片,所以改的地方比较多。


下面这位大佬真的牛,感谢这篇文章帮我解决了问题,感激不尽!
博客链接:https://blog.csdn.net/qq_35756383/article/details/126787282



/// 华丽的分割线 /

本文对c++推理的yolov5 v6.1代码进行精度对齐实现,以yolov5-l为例。

yolov5:https://github.com/ultralytics/yolov5

tensorrtx:GitHub - wang-xinyu/tensorrtx: Implementation of popular deep learning networks with TensorRT network definition API

本文代码:yolov5-tenssort: yolov5 v6.1 的tensorrt c++推理精度对齐

实验环境

  • Ubuntu20.04
  • TensorRT-7.2.3.4
  • OpenCV3.4.8(c++)、4.6.0(torch)
  • CUDA11.1
  • RTX3060

tensorrt跑通


   
   
    
    
  1. git clone https: / /github.com /wang-xinyu /tensorrtx.git
  2. cd tensorrtx /yolov 5
  3. mkdir build
  4. cd build

修改CMakeLists.txt中cuda和tensorrt的路径,以及opencv的版本:

进行cmake:

cmake ..
   
   
    
    

修改网络中对应参数以适应自己数据集要求:

yolov5.cpp:

yololayer.h:

编译,会在build路径下新生成一个libmyplugins.so文件和yolov5文件:

make
   
   
    
    

参考README.md文件,在yolov5下将训练得到的权重文件best.pt通过get_wts.py转化为best.wts文件,并放至tenosrrtx/yolov5/build路径下:


   
   
    
    
  1. git clone https: //github.com/ultralytics/yolov5
  2. cd yolov5
  3. // 修改gen_wts.py中p28的cpu为gpu:
  4. device = select_device( '0')
  5. cp <path>/tensorrtx/yolov5/gen_wts.py ./
  6. python gen_wts.py -w best.pt -o best.wts
  7. cp best.wts <dir_path>/tensorrtx/yolov5/build/
  8. cd <path>/tensorrtx/yolov5/build

生成engine,会在build路径下生成tensorrt的best.engine模型:

./yolov5 -s best.wts best.engine l
   
   
    
    

读取.engine文件,并根据指定路径下的图片来推理:

./yolov5 -d best.engine <imgs_dir>
   
   
    
    

将在build路径下生成推理结果,并打印推理时间:

torch与tensorrt精度对比

1. tensorrt推理结果

增加c++的txt输出:


   
   
    
    
  1. // -------yolov5.cpp main(~)
  2. std::string out_path;
  3. // cv::putText(~)下方
  4. out_path = "_" + file_names[f - fcount + 1 + b];
  5. write2txt(out_path. replace(out_path. find( "."), 4, ".txt"), std:: to_string(( int)res[j].class_id), std:: to_string(res[j].conf), r);
  6. // -------common.hpp
  7. void write2txt(std::string txtpath, std::string cls, std::string conf, cv::Rect r){
  8. std::ofstream ofs;
  9. ofs.open (txtpath, std::ios::app); // std::ios::app does not cover
  10. // process the coordinates
  11. int xmin, xmax, ymin, ymax;
  12. xmin = ( int)r.x;
  13. ymin = ( int)r.y;
  14. xmax = ( int)(r.x + r.width);
  15. ymax = ( int)(r.y + r.height);
  16. ofs << cls << " " << conf << " " << xmin << " " << ymin << " " << xmax << " " << ymax << std::endl; //endl用于换行
  17. ofs. close();
  18. }

将c++的参数值修改为与torch一致:


   
   
    
    
  1. // yolov5.cpp
  2. #define NMS_THRESH 0.45
  3. #define CONF_THRESH 0.25
  4. // yololayer.h
  5. static constexpr float IGNORE_THRESH = 0.25f;

对图像进行推理,输出结果:

0 0.926520 52 408 214 874
0 0.906347 214 412 321 860
0 0.870304 676 483 810 872
0 0.863786 0 621 63 868
45 0.950376 -50 101 883 817
55 0.904248 1 253 34 327

2. torch推理结果

通过yolov5/detect.py,进行推理输出:

python detect.py --weights best.pt --source bus.jpg --save-txt --save-conf
   
   
    
    

结果保存在run/detect/exp/下:

3 0.832716 0.618981 0.0382716 0.0824074 0.291247
0 0.041358 0.687963 0.082716 0.246296 0.602841
0 0.0240741 0.386111 0.045679 0.109259 0.658574
0 0.919136 0.618056 0.159259 0.369444 0.77239
55 0.0209877 0.268056 0.0419753 0.0694444 0.893587
0 0.327778 0.588426 0.166667 0.417593 0.907808
0 0.164815 0.592593 0.196296 0.431481 0.932
45 0.5 0.418519 1 0.681481 0.981999

为了方便对比,修改detect.py保存txt的格式:


   
   
    
    
  1. # Write results
  2. for *xyxy, conf, cls in reversed(det):
  3. c = int(cls) # integer class
  4. if save_txt: # Write to file
  5. line = (c, conf, *xyxy) if save_conf else (cls, *xyxy)
  6. with open( f'{txt_path}.txt', 'a') as f:
  7. f.write(( '%s ') % line[ 0])
  8. f.write(( '%g ' * ( len(line) - 1)).rstrip() % line[ 1:] + '\n')
  9. if save_img or save_crop or view_img: # Add bbox to image
  10. label = None if hide_labels else (names[c] if hide_conf else f'{names[c]} {conf:.2f}')
  11. annotator.box_label(xyxy, label, color=colors(c, True))
  12. if save_crop:
  13. save_one_box(xyxy, imc, file=save_dir / 'crops' / names[c] / f'{p.stem}.jpg', BGR= True)
3 0.291247 659 624 690 713
0 0.602841 0 610 67 876
0 0.658574 1 358 38 476
0 0.77239 680 468 809 867
55 0.893587 0 252 34 327
0 0.907808 198 410 333 861
0 0.932 54 407 213 873
45 0.981999 0 84 810 820

3. 结果对比

可以发现,对于同一张图片,c++和torch的结果不论是在目标数量上还是在各项数值上均不相同,需要进行排查。

问题排查与解决

1. 图像预处理

根据代码可知,torch使用的是640x*的矩形推理,填充部分为144:


   
   
    
    
  1. # utils/augmentations.py
  2. def letterbox( im, new_shape=(640, 640), color=(114, 114, 114), auto=True, scaleFill=False, scaleup=True, stride=32):
  3. # Resize and pad image while meeting stride-multiple constraints
  4. shape = im.shape[: 2] # current shape [height, width]
  5. if isinstance(new_shape, int):
  6. new_shape = (new_shape, new_shape)
  7. # Scale ratio (new / old)
  8. r = min(new_shape[ 0] / shape[ 0], new_shape[ 1] / shape[ 1])
  9. if not scaleup: # only scale down, do not scale up (for better val mAP)
  10. r = min(r, 1.0)
  11. # Compute padding
  12. ratio = r, r # width, height ratios
  13. new_unpad = int( round(shape[ 1] * r)), int( round(shape[ 0] * r))
  14. dw, dh = new_shape[ 1] - new_unpad[ 0], new_shape[ 0] - new_unpad[ 1] # wh padding
  15. if auto: # minimum rectangle
  16. dw, dh = np.mod(dw, stride), np.mod(dh, stride) # wh padding
  17. elif scaleFill: # stretch
  18. dw, dh = 0.0, 0.0
  19. new_unpad = (new_shape[ 1], new_shape[ 0])
  20. ratio = new_shape[ 1] / shape[ 1], new_shape[ 0] / shape[ 0] # width, height ratios
  21. dw /= 2 # divide padding into 2 sides
  22. dh /= 2
  23. if shape[::- 1] != new_unpad: # resize
  24. im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
  25. top, bottom = int( round(dh - 0.1)), int( round(dh + 0.1))
  26. left, right = int( round(dw - 0.1)), int( round(dw + 0.1))
  27. im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color) # add border
  28. return im, ratio, (dw, dh)

首先将输入图像按照长边rezise至640x*,方式为双线性插值,然后将短边padding到32的最小倍数。

而c++使用640x640的letterbox,填充为128:


   
   
    
    
  1. // preprocess.cu
  2. __global__ void warpaffine_kernel(~){
  3. ...
  4. float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;
  5. float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;
  6. ...
  7. }
  8. void preprocess_kernel_img(~){
  9. ...
  10. warpaffine_kernel<<<blocks, threads, 0, stream>>>(
  11. src, src_width* 3, src_width,
  12. src_height, dst, dst_width,
  13. dst_height, 128, d2s, jobs);
  14. }

鉴于c++修改为动态输入比较复杂,这里只将两者的640x640输入结果进行对齐。

关闭torch的矩形推理:


   
   
    
    
  1. # utils/augmentations.py --> letterbox(~)
  2. if auto: # minimum rectangle
  3. pass
  4. # dw, dh = np.mod(dw, stride), np.mod(dh, stride) # wh padding

修改c++的padding为114:


   
   
    
    
  1. // preprocess.cu
  2. void preprocess_kernel_img(~){
  3. ...
  4. warpaffine_kernel<<<blocks, threads, 0, stream>>>(
  5. src, src_width* 3, src_width,
  6. src_height, dst, dst_width,
  7. dst_height, 114, d2s, jobs);
  8. }

添加输出两者图片预处理后结果的代码进行查看:


   
   
    
    
  1. # utils/datasets.py
  2. class LoadImages:
  3. ...
  4. # Padded resize
  5. img = letterbox(img0, self.img_size, stride=self.stride, auto=self.auto)[ 0]
  6. # 输出从(400,400)位置开始的10x10区域的像素点rgb值
  7. for i in range( 400, 410):
  8. for j in range( 400, 410):
  9. print( "{}, {}, {}; ". format(img[i][j][ 0], img[i][j][ 1], img[i][j][ 2]), end= '')
  10. print()
  11. ...

   
   
    
    
  1. // yolov5.cpp
  2. // 图像预处理
  3. preprocess_kernel_img(img_device, img.cols, img.rows, buffer_idx, INPUT_W, INPUT_H, stream);
  4. // 预处理结果存到CPU
  5. float* recvCPU=( float*) malloc(size_image_dst* sizeof( float));
  6. CUDA_CHECK( cudaMemcpy(recvCPU, buffer_idx,size_image_dst* sizeof( float),cudaMemcpyDeviceToHost));
  7. cv::Mat resize_img(INPUT_H,INPUT_W,CV_8UC3);
  8. for ( int i = 0; i < INPUT_H; ++i){
  9. cv::Vec3b *p2 = resize_img. ptr<cv::Vec3b>(i);
  10. for ( int j = 0; j < INPUT_W; ++j){
  11. p2[j][ 2] = round(recvCPU[i*INPUT_W+j]* 255);
  12. p2[j][ 1] = round(recvCPU[INPUT_W*INPUT_H+i*INPUT_W+j]* 255);
  13. p2[j][ 0] = round(recvCPU[ 2*INPUT_W*INPUT_H+i*INPUT_W+j]* 255);
  14. }
  15. }
  16. for ( int i = 400; i < 410; i++) {
  17. uchar *data = resize_img. ptr<uchar>(i); //ptr函数访问任意一行像素的首地址,特别方便图像地一行一行的横向访问
  18. for ( int j = 400* 3; j < 400* 3+ 10* 3; j++) { // //在循环体内,应该避免多次运算,应该提前算cols*channels
  19. std::cout<<( int)data[j]<< ", ";
  20. }
  21. std::cout<< ""<<std::endl;
  22. }

对比两者图片预处理后输出结果:


   
   
    
    
  1. # torch
  2. 25, 1, 0; 25, 1, 0; 24, 1, 0; 25, 2, 0; 25, 2, 0; 24, 1, 0; 24, 1, 0; 25, 1, 0; 25, 2, 0; 26, 2, 1;
  3. 26, 0, 0; 27, 0, 3; 26, 1, 2; 25, 2, 0; 24, 1, 0; 24, 1, 0; 24, 1, 0; 27, 2, 0; 26, 0, 0; 26, 0, 0;
  4. 27, 0, 3; 26, 2, 4; 26, 0, 1; 26, 0, 0; 28, 2, 2; 27, 1, 1; 27, 1, 1; 27, 1, 1; 28, 2, 2; 28, 2, 2;
  5. 24, 0, 0; 25, 1, 1; 27, 1, 1; 28, 2, 2; 28, 2, 2; 27, 1, 1; 27, 1, 1; 27, 2, 1; 27, 2, 0; 27, 2, 0;
  6. 23, 2, 0; 23, 2, 0; 24, 1, 1; 25, 1, 1; 26, 2, 2; 25, 1, 1; 25, 2, 0; 24, 1, 0; 25, 2, 0; 26, 3, 1;
  7. 25, 1, 1; 25, 1, 1; 24, 0, 0; 25, 1, 1; 25, 2, 0; 25, 2, 0; 25, 2, 0; 26, 3, 1; 25, 2, 0; 25, 2, 0;
  8. 25, 1, 2; 26, 1, 2; 25, 1, 1; 24, 1, 0; 24, 2, 0; 24, 2, 0; 24, 2, 0; 24, 2, 0; 25, 3, 0; 26, 5, 0;
  9. 24, 0, 0; 25, 1, 2; 23, 1, 0; 23, 2, 0; 23, 2, 0; 23, 2, 0; 23, 2, 0; 24, 4, 2; 24, 4, 0; 24, 4, 0;
  10. 24, 3, 1; 22, 1, 0; 24, 3, 1; 23, 2, 1; 22, 1, 0; 23, 2, 0; 23, 3, 0; 24, 4, 0; 22, 2, 0; 25, 5, 1;
  11. 25, 3, 2; 23, 2, 1; 26, 5, 1; 26, 6, 2; 25, 4, 2; 28, 7, 5; 24, 3, 1; 29, 8, 6; 27, 6, 4; 28, 7, 4;
  12. // c++
  13. 26, 1, 0, 26, 1, 0, 25, 1, 0, 25, 2, 0, 24, 1, 0, 24, 1, 0, 24, 1, 0, 25, 1, 0, 26, 2, 1, 27, 2, 1,
  14. 26, 0, 0, 27, 0, 3, 26, 2, 2, 25, 1, 0, 25, 1, 0, 24, 1, 0, 25, 1, 0, 27, 2, 0, 26, 0, 0, 26, 0, 0,
  15. 27, 0, 3, 26, 2, 4, 26, 0, 0, 26, 0, 0, 28, 2, 2, 27, 1, 1, 28, 2, 2, 27, 1, 1, 28, 2, 2, 28, 2, 2,
  16. 24, 0, 0, 25, 1, 1, 27, 1, 1, 29, 3, 3, 28, 2, 2, 27, 1, 1, 27, 1, 1, 27, 2, 0, 28, 3, 1, 28, 3, 1,
  17. 23, 2, 0, 23, 2, 0, 25, 1, 1, 26, 2, 2, 26, 2, 2, 25, 1, 1, 25, 2, 0, 25, 2, 0, 25, 2, 0, 26, 3, 1,
  18. 25, 1, 1, 25, 1, 1, 24, 0, 0, 25, 2, 1, 25, 2, 0, 25, 2, 0, 25, 2, 0, 26, 3, 1, 25, 2, 0, 25, 3, 0,
  19. 25, 1, 2, 26, 0, 2, 25, 1, 1, 24, 2, 0, 24, 2, 0, 24, 2, 0, 24, 2, 0, 24, 2, 0, 25, 4, 0, 26, 5, 0,
  20. 24, 1, 0, 24, 1, 2, 23, 2, 0, 23, 2, 0, 23, 2, 0, 23, 2, 0, 23, 2, 0, 24, 4, 1, 24, 4, 0, 24, 4, 0,
  21. 24, 3, 1, 22, 1, 0, 25, 4, 2, 23, 2, 0, 22, 1, 0, 23, 2, 0, 23, 3, 0, 24, 4, 0, 22, 2, 0, 26, 6, 1,
  22. 24, 3, 2, 23, 2, 1, 26, 6, 1, 26, 5, 2, 25, 4, 2, 29, 8, 6, 24, 3, 1, 30, 9, 7, 28, 7, 5, 29, 8, 6,

结果值仍不相同。

根据:一篇文章为你讲透双线性插值 - 知乎 可知,几何中心点重合对应公式:

因此对c++中双线性插值实现进行修改:


   
   
    
    
  1. // preprocess.cu
  2. __global__ void warpaffine_kernel(~){
  3. ...
  4. // float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;
  5. // float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;
  6. // 目标图像上的点对应于原图上的点的坐标
  7. float src_x = m_x1 * (dx+ 0.5f) + m_y1 * (dy+ 0.5f) + m_z1 - 0.5f;
  8. float src_y = m_x2 * (dx+ 0.5f) + m_y2 * (dy+ 0.5f) + m_z2 - 0.5f;
  9. ...
  10. }

对比两者图片预处理后输出结果:


   
   
    
    
  1. # torch
  2. 25, 1, 0; 25, 1, 0; 24, 1, 0; 25, 2, 0; 25, 2, 0; 24, 1, 0; 24, 1, 0; 25, 1, 0; 25, 2, 0; 26, 2, 1;
  3. 26, 0, 0; 27, 0, 3; 26, 1, 2; 25, 2, 0; 24, 1, 0; 24, 1, 0; 24, 1, 0; 27, 2, 0; 26, 0, 0; 26, 0, 0;
  4. 27, 0, 3; 26, 2, 4; 26, 0, 1; 26, 0, 0; 28, 2, 2; 27, 1, 1; 27, 1, 1; 27, 1, 1; 28, 2, 2; 28, 2, 2;
  5. 24, 0, 0; 25, 1, 1; 27, 1, 1; 28, 2, 2; 28, 2, 2; 27, 1, 1; 27, 1, 1; 27, 2, 1; 27, 2, 0; 27, 2, 0;
  6. 23, 2, 0; 23, 2, 0; 24, 1, 1; 25, 1, 1; 26, 2, 2; 25, 1, 1; 25, 2, 0; 24, 1, 0; 25, 2, 0; 26, 3, 1;
  7. 25, 1, 1; 25, 1, 1; 24, 0, 0; 25, 1, 1; 25, 2, 0; 25, 2, 0; 25, 2, 0; 26, 3, 1; 25, 2, 0; 25, 2, 0;
  8. 25, 1, 2; 26, 1, 2; 25, 1, 1; 24, 1, 0; 24, 2, 0; 24, 2, 0; 24, 2, 0; 24, 2, 0; 25, 3, 0; 26, 5, 0;
  9. 24, 0, 0; 25, 1, 2; 23, 1, 0; 23, 2, 0; 23, 2, 0; 23, 2, 0; 23, 2, 0; 24, 4, 2; 24, 4, 0; 24, 4, 0;
  10. 24, 3, 1; 22, 1, 0; 24, 3, 1; 23, 2, 1; 22, 1, 0; 23, 2, 0; 23, 3, 0; 24, 4, 0; 22, 2, 0; 25, 5, 1;
  11. 25, 3, 2; 23, 2, 1; 26, 5, 1; 26, 6, 2; 25, 4, 2; 28, 7, 5; 24, 3, 1; 29, 8, 6; 27, 6, 4; 28, 7, 4;
  12. // c++
  13. 25, 1, 0, 25, 1, 0, 25, 1, 0, 25, 2, 0, 25, 2, 0, 24, 1, 0, 24, 1, 0, 25, 1, 0, 26, 2, 0, 26, 2, 1,
  14. 26, 0, 0, 27, 0, 3, 26, 1, 2, 25, 2, 0, 24, 1, 0, 24, 1, 0, 25, 1, 0, 27, 2, 0, 26, 0, 0, 26, 0, 0,
  15. 27, 0, 3, 26, 2, 4, 26, 0, 1, 26, 0, 0, 28, 2, 2, 27, 1, 1, 27, 1, 1, 27, 1, 1, 28, 2, 2, 28, 2, 2,
  16. 24, 0, 0, 25, 1, 1, 27, 1, 1, 28, 2, 2, 28, 2, 2, 27, 1, 1, 27, 1, 1, 27, 2, 1, 27, 2, 0, 27, 2, 0,
  17. 23, 2, 0, 23, 2, 0, 24, 1, 1, 25, 1, 1, 26, 2, 2, 25, 1, 1, 25, 2, 0, 24, 1, 0, 25, 2, 0, 26, 3, 1,
  18. 25, 1, 1, 25, 1, 1, 24, 0, 0, 25, 1, 1, 25, 2, 0, 25, 2, 0, 25, 2, 0, 26, 3, 1, 25, 2, 0, 25, 2, 0,
  19. 25, 1, 2, 26, 1, 2, 25, 1, 1, 24, 2, 0, 24, 2, 0, 24, 2, 0, 24, 2, 0, 24, 2, 0, 25, 3, 1, 26, 5, 0,
  20. 24, 1, 0, 25, 1, 3, 23, 2, 0, 23, 2, 0, 23, 2, 0, 23, 2, 0, 23, 2, 0, 25, 4, 2, 24, 4, 0, 24, 4, 0,
  21. 24, 3, 1, 22, 1, 0, 24, 3, 1, 23, 2, 1, 22, 1, 0, 23, 2, 0, 23, 3, 0, 24, 4, 0, 23, 3, 0, 25, 5, 1,
  22. 25, 4, 2, 23, 2, 1, 26, 5, 1, 26, 6, 2, 25, 4, 2, 28, 7, 5, 24, 3, 2, 29, 8, 6, 27, 6, 4, 28, 7, 5,

The results are basically the same, and there are still some slight differences. So far, the image preprocessing results are aligned.

2. Network structure

Comparing the network structure implementations of torch and c++, there is no abnormality. Pay attention to the parameters of the BN layer, which are the default parameters in torch:


   
   
    
    
  1. # models/commom.py
  2. self.bn = nn.BatchNorm2d(c2)

where eps is 1e-5.

The BN layer eps of c++ is 1e-3:


   
   
    
    
  1. // common.hpp
  2. IScaleLayer* bn = addBatchNorm2d(network, weightMap, *cat-> getOutput( 0), lname + ".bn", 1e-3);

进行相应修改。

3. Post-processing of network output

torch:


   
   
    
    
  1. # utils/general.py
  2. def non_max_suppression( prediction, conf_thres=0.25, iou_thres=0.45, classes=None, agnostic=False, multi_label=False, labels=(), max_det= 300 ):
  3. """Runs Non-Maximum Suppression (NMS) on inference results
  4. Returns:
  5. list of detections, on (n,6) tensor per image [xyxy, conf, cls]
  6. """
  7. nc = prediction.shape[ 2] - 5 # number of classes
  8. xc = prediction[..., 4 ] > conf_thres #obj_conf>conf_thres
  9. # Checks
  10. assert 0 <= conf_thres <= 1, f'Invalid Confidence threshold {conf_thres}, valid values are between 0.0 and 1.0'
  11. assert 0 <= iou_thres <= 1, f'Invalid IoU {iou_thres}, valid values are between 0.0 and 1.0'
  12. # Settings
  13. min_wh, max_wh = 2, 7680 # (pixels) minimum and maximum box width and height
  14. max_nms = 30000 # maximum number of boxes into torchvision.ops.nms()
  15. time_limit = 10.0 # seconds to quit after
  16. redundant = True # require redundant detections
  17. multi_label &= nc > 1 # multiple labels per box (adds 0.5ms/img)
  18. merge = False # use merge-NMS
  19. t = time.time()
  20. output = [torch.zeros(( 0, 6), device=prediction.device)] * prediction.shape[ 0]
  21. for xi, x in enumerate(prediction): # image index, image inference
  22. # Apply constraints
  23. x[((x[..., 2: 4] < min_wh) | (x[..., 2: 4] > max_wh)). any( 1), 4] = 0 # width-height
  24. x = x[xc[xi]] # confidence
  25. # Cat apriori labels if autolabelling
  26. if labels and len(labels[xi]):
  27. lb = labels[xi]
  28. v = torch.zeros(( len(lb), nc + 5), device=x.device)
  29. v[:, : 4] = lb[:, 1: 5] # box
  30. v[:, 4] = 1.0 # conf
  31. v[ range( len(lb)), lb[:, 0].long() + 5] = 1.0 # cls
  32. x = torch.cat((x, v), 0)
  33. # If none remain process next image
  34. if not x.shape[ 0]:
  35. continue
  36. # Compute conf
  37. x[:, 5:] *= x[:, 4: 5] # conf = obj_conf * cls_conf
  38. # Box (center x, center y, width, height) to (x1, y1, x2, y2)
  39. box = xywh2xyxy(x[:, : 4])
  40. # Detections matrix nx6 (xyxy, conf, cls)
  41. if multi_label:
  42. i, j = (x[:, 5:] > conf_thres).nonzero(as_tuple= False).T
  43. x = torch.cat((box[i], x[i, j + 5, None], j[:, None]. float()), 1)
  44. else: # best class only
  45. conf, j = x[:, 5:]. max( 1, keepdim= True)
  46. x = torch.cat((box, conf, j. float()), 1)[conf.view(- 1) > conf_thres] # conf>conf_thres
  47. # Filter by class
  48. if classes is not None:
  49. x = x[(x[:, 5: 6] == torch.tensor(classes, device=x.device)). any( 1)]
  50. # Apply finite constraint
  51. # if not torch.isfinite(x).all():
  52. # x = x[torch.isfinite(x).all(1)]
  53. # Check shape
  54. n = x.shape[ 0] # number of boxes
  55. if not n: # no boxes
  56. continue
  57. elif n > max_nms: # excess boxes
  58. x = x[x[:, 4].argsort(descending= True)[:max_nms]] # sort by confidence
  59. ...

The maximum number of boxes output by the network does not exceed max_nms=30000, and the obj_conf of each box must be greater than conf_thres=0.25, and the total conf(=obj_conf*cls_conf) must also be greater than conf_thres.

c++:


   
   
    
    
  1. // yololayer.cu
  2. __global__ void CalDetection(~){
  3. ...
  4. for ( int k = 0; k < CHECK_COUNT; ++k) {
  5. ...
  6. if (box_prob < IGNORE_THRESH) continue;
  7. ...
  8. int count = ( int) atomicAdd(res_count, 1);
  9. if (count >= maxoutobject) return;
  10. ...
  11. }
  12. ...
  13. }

Only obj_conf, without proofreading the total conf, add:


   
   
    
    
  1. // yololayer.cu
  2. __global__ void CalDetection(~){
  3. ...
  4. float max_cls_prob = 0.0;
  5. for ...
  6. if (box_prob * max_cls_prob < IGNORE_THRESH) continue; // conf < thres
  7. ...
  8. }

4. nms post-processing

torch:


   
   
    
    
  1. # utils/general.py
  2. def non_max_suppression(prediction, conf_thres=0.25, iou_thres=0.45, classes=None, agnostic=False, multi_label=False, labels=(), max_det=300):
  3. ...
  4. # Batched NMS
  5. c = x[:, 5: 6] * ( 0 if agnostic else max_wh) # classes
  6. boxes, scores = x[:, : 4] + c, x[:, 4] # boxes (offset by class), scores
  7. i = torchvision.ops. nms(boxes, scores, iou_thres) # NMS
  8. if i.shape[ 0] > max_det: # limit detections
  9. i = i[:max_det]
  10. if merge and ( 1 < n < 3E3): # Merge NMS (boxes merged using weighted mean)
  11. # update boxes as boxes(i,4) = weights(i,n) * boxes(n,4)
  12. iou = box_iou(boxes[i], boxes) > iou_thres # iou matrix
  13. weights = iou * scores[None] # box weights
  14. x[i, : 4] = torch. mm(weights, x[:, : 4]). float() / weights. sum( 1, keepdim=True) # merged boxes
  15. if redundant:
  16. i = i[iou. sum( 1) > 1] # require redundancy
  17. output[xi] = x[i]
  18. ...

nms后,如果超过max_det=1000个框,则只保存conf从高到低的前1000个框。

c++,增加对输出数量的校对:


   
   
    
    
  1. // commom.hpp
  2. void nms(~){
  3. ...
  4. for ( auto it = m. begin(); it != m. end(); it++) {
  5. ...
  6. // 只保存conf前1000个结果
  7. std:: sort(res. begin(), res. end(), cmp);
  8. if(res. size()>Yolo::MAX_OUTPUT_BBOX_COUNT){
  9. res. erase(res. begin()+Yolo::MAX_OUTPUT_BBOX_COUNT, res. end());
  10. }
  11. }
  12. }

5. 坐标转换后处理

torch:


   
   
    
    
  1. # utils/general.py
  2. def scale_coords(img1_shape, coords, img0_shape, ratio_pad=None):
  3. # Rescale coords (xyxy) from img1_shape to img0_shape
  4. if ratio_pad is None: # calculate from img0_shape
  5. gain = min(img1_shape[ 0] / img0_shape[ 0], img1_shape[ 1] / img0_shape[ 1]) # gain = old / new
  6. pad = (img1_shape[ 1] - img0_shape[ 1] * gain) / 2, (img1_shape[ 0] - img0_shape[ 0] * gain) / 2 # wh padding
  7. else:
  8. gain = ratio_pad[ 0][ 0]
  9. pad = ratio_pad[ 1]
  10. coords[:, [ 0, 2]] -= pad[ 0] # x padding
  11. coords[:, [ 1, 3]] -= pad[ 1] # y padding
  12. coords[:, : 4] /= gain
  13. clip_coords(coords, img0_shape)
  14. return coords
  15. def clip_coords(boxes, shape):
  16. # Clip bounding xyxy bounding boxes to image shape (height, width)
  17. if isinstance(boxes, torch.Tensor): # faster individually
  18. boxes[:, 0]. clamp_( 0, shape[ 1]) # x1
  19. boxes[:, 1]. clamp_( 0, shape[ 0]) # y1
  20. boxes[:, 2]. clamp_( 0, shape[ 1]) # x2
  21. boxes[:, 3]. clamp_( 0, shape[ 0]) # y2
  22. else: # np.array (faster grouped)
  23. boxes[:, [ 0, 2]] = boxes[:, [ 0, 2]]. clip( 0, shape[ 1]) # x1, x2
  24. boxes[:, [ 1, 3]] = boxes[:, [ 1, 3]]. clip( 0, shape[ 0]) # y1, y2

c++:


   
   
    
    
  1. // common.hpp
  2. cv::Rect get_rect(cv::Mat& img, float bbox[4]) {
  3. float l, r, t, b;
  4. float r_w = Yolo::INPUT_W / (img.cols * 1.0);
  5. float r_h = Yolo::INPUT_H / (img.rows * 1.0);
  6. if (r_h > r_w) {
  7. l = bbox[ 0 ] - bbox[ 2 ] / 2.f ;
  8. r = bbox[ 0 ] + bbox[ 2 ] / 2.f ;
  9. t = bbox[ 1] - bbox[ 3] / 2.f - (Yolo::INPUT_H - r_w * img.rows) / 2;
  10. b = bbox[ 1] + bbox[ 3] / 2.f - (Yolo::INPUT_H - r_w * img.rows) / 2;
  11. l = l / r_w;
  12. r = r / r_w;
  13. t = t / r_w;
  14. b = b / r_w;
  15. } else {
  16. l = bbox[ 0] - bbox[ 2] / 2.f - (Yolo::INPUT_W - r_h * img.cols) / 2;
  17. r = bbox[ 0] + bbox[ 2] / 2.f - (Yolo::INPUT_W - r_h * img.cols) / 2;
  18. t = bbox[ 1] - bbox[ 3] / 2.f;
  19. b = bbox[ 1] + bbox[ 3] / 2.f;
  20. l = l / r_h;
  21. r = r / r_h;
  22. t = t / r_h;
  23. b = b / r_h;
  24. }
  25. return cv:: Rect( round(l), round(t), round(r - l), round(b - t));
  26. }

转换的方法有些微不同,且没有对坐标的越界进行判断。

修改后:


   
   
    
    
  1. // common.hpp
  2. float clip_coords(float x, int xmin, int xmax) {
  3. if (x < xmin) {
  4. x = xmin;
  5. }
  6. if (x > xmax ){
  7. x = xmax;
  8. }
  9. return x;
  10. }
  11. // yolov5/utils/general.py xywh2xyxy(~) and scale_coords(~)
  12. cv::Rect get_rect(cv::Mat& img, float bbox[4]) {
  13. // xc,yc,w,h --> xmin,ymin,xmax,ymax
  14. float l, r, t, b;
  15. l = bbox[ 0] - bbox[ 2] / 2.f;
  16. r = bbox[ 0] + bbox[ 2] / 2.f;
  17. t = bbox[ 1] - bbox[ 3] / 2.f;
  18. b = bbox[ 1] + bbox[ 3] / 2.f;
  19. // Rescale coords (xyxy) from dst shape(640x640) to src shape
  20. float pad[ 2];
  21. float gain = std:: min( ( float)Yolo::INPUT_W / ( float)img.cols, ( float)Yolo::INPUT_H / ( float)img.rows);
  22. pad[ 0] = (Yolo::INPUT_W - img.cols * gain)/ 2;
  23. pad[ 1] = (Yolo::INPUT_H - img.rows * gain)/ 2;
  24. l = ( l - pad[ 0] ) / gain; // x padding
  25. r = ( r - pad[ 0] ) / gain;
  26. t = ( t - pad[ 1] ) / gain; // y padding
  27. b = ( b - pad[ 1] ) / gain;
  28. // 越界
  29. l = clip_coords(l, 0, img.cols);
  30. r = clip_coords(r, 0, img.cols);
  31. t = clip_coords(t, 0, img.rows);
  32. b = clip_coords(b, 0, img.rows);
  33. // xmin,ymin,xmax,ymax --> xmin, ymin, w, h
  34. return cv:: Rect( round(l), round(t), round(r - l), round(b - t));
  35. }

6. 结果比对

c++:
45 0.973848 0 126 810 797
0 0.931803 50 408 215 875
55 0.923260 0 254 33 328
0 0.922524 215 412 323 863
0 0.917015 677 485 810 871
0 0.883489 0 622 64 868
3 0.594060 119 768 156 816
torch:
3 0.253175 120 767 158 815
0 0.859743 677 484 810 872
0 0.863529 1 620 63 868
55 0.906701 0 254 33 328
0 0.907883 214 412 321 861
0 0.922536 52 408 214 876
45 0.962665 0 106 810 813

预测目标个数相同,坐标值基本对应上了,虽然置信度有所不同,但c++普遍比torch高。

本文对c++推理的yolov5 v6.1代码进行精度对齐实现,以yolov5-l为例。

yolov5:https://github.com/ultralytics/yolov5

tensorrtx:GitHub - wang-xinyu/tensorrtx: Implementation of popular deep learning networks with TensorRT network definition API

本文代码:yolov5-tenssort: yolov5 v6.1 的tensorrt c++推理精度对齐

实验环境

  • Ubuntu20.04
  • TensorRT-7.2.3.4
  • OpenCV3.4.8(c++)、4.6.0(torch)
  • CUDA11.1
  • RTX3060

tensorrt跑通


   
   
  
  
  1. git clone https: / /github.com /wang-xinyu /tensorrtx.git
  2. cd tensorrtx /yolov 5
  3. mkdir build
  4. cd build

修改CMakeLists.txt中cuda和tensorrt的路径,以及opencv的版本:

进行cmake:

cmake ..
   
   
  
  

修改网络中对应参数以适应自己数据集要求:

yolov5.cpp:

yololayer.h:

After compiling, a new libmyplugins.so file and yolov5 file will be generated under the build path:

make
   
   
  
  

Refer to the README.md file, convert the weight file best.pt obtained from training under yolov5 into a best.wts file through get_wts.py, and put it in the tenosrrtx/yolov5/build path:


   
   
  
  
  1. git clone https: //github.com/ultralytics/yolov5
  2. cd yolov5
  3. // Modify the cpu of p28 in gen_wts.py to gpu:
  4. device = select_device( '0')
  5. cp <path>/tensorrtx/yolov5/gen_wts.py ./
  6. python gen_wts.py -w best.pt -o best.wts
  7. cp best.wts <dir_path>/tensorrtx/yolov5/build/
  8. cd <path>/tensorrtx/yolov5/build

Generating the engine will generate tensorrt's best.engine model under the build path:

./yolov5 -s best.wts best.engine l
   
   
  
  

Read the .engine file and reason based on the pictures in the specified path:

./yolov5 -d best.engine <imgs_dir>
   
   
  
  

将在build路径下生成推理结果,并打印推理时间:

Accuracy comparison between torch and tensorrt

1. tensorrt reasoning results

Increase the txt output of c++:


   
   
  
  
  1. // -------yolov5.cpp main(~)
  2. std::string out_path;
  3. // Below cv::putText(~)
  4. out_path = "_" + file_names[f - fcount + 1 + b];
  5. write2txt(out_path. replace(out_path. find( "."), 4, ".txt"), std:: to_string(( int)res[j].class_id), std:: to_string(res[j].conf), r);
  6. // -------common.hpp
  7. void write2txt(std::string txtpath, std::string cls, std::string conf, cv::Rect r){
  8. std::ofstream ofs;
  9. ofs.open (txtpath, std::ios::app); // std::ios::app does not cover
  10. // process the coordinates
  11. int xmin, xmax, ymin, ymax;
  12. xmin = ( int)r.x;
  13. ymin = ( int)r.y;
  14. xmax = ( int)(r.x + r.width);
  15. ymax = ( int)(r.y + r.height);
  16. ofs << cls << " " << conf << " " << xmin << " " << ymin << " " << xmax << " " << ymax << std::endl; //endl用于换行
  17. ofs. close();
  18. }

将c++的参数值修改为与torch一致:


   
   
  
  
  1. // yolov5.cpp
  2. #define NMS_THRESH 0.45
  3. #define CONF_THRESH 0.25
  4. // yololayer.h
  5. static constexpr float IGNORE_THRESH = 0.25f;

对图像进行推理,输出结果:

0 0.926520 52 408 214 874
0 0.906347 214 412 321 860
0 0.870304 676 483 810 872
0 0.863786 0 621 63 868
45 0.950376 -50 101 883 817
55 0.904248 1 253 34 327

2. torch推理结果

通过yolov5/detect.py,进行推理输出:

python detect.py --weights best.pt --source bus.jpg --save-txt --save-conf
   
   
  
  

结果保存在run/detect/exp/下:

3 0.832716 0.618981 0.0382716 0.0824074 0.291247
0 0.041358 0.687963 0.082716 0.246296 0.602841
0 0.0240741 0.386111 0.045679 0.109259 0.658574
0 0.919136 0.618056 0.159259 0.369444 0.77239
55 0.0209877 0.268056 0.0419753 0.0694444 0.893587
0 0.327778 0.588426 0.166667 0.417593 0.907808
0 0.164815 0.592593 0.196296 0.431481 0.932
45 0.5 0.418519 1 0.681481 0.981999

For comparison, modify the format of detect.py to save txt:


   
   
  
  
  1. # Write results
  2. for *xyxy, conf, cls in reversed(det):
  3. c = int(cls) # integer class
  4. if save_txt: # Write to file
  5. line = (c, conf, *xyxy) if save_conf else (cls, *xyxy)
  6. with open( f'{txt_path}.txt', 'a') as f:
  7. f.write(( '%s ') % line[ 0])
  8. f.write(( '%g ' * ( len(line) - 1)).rstrip() % line[ 1:] + '\n')
  9. if save_img or save_crop or view_img: # Add bbox to image
  10. label = None if hide_labels else (names[c] if hide_conf else f'{names[c]} {conf:.2f}')
  11. annotator.box_label(xyxy, label, color=colors(c, True))
  12. if save_crop:
  13. save_one_box(xyxy, imc, file=save_dir / 'crops' / names[c] / f'{p.stem}.jpg', BGR= True)
3 0.291247 659 624 690 713
0 0.602841 0 610 67 876
0 0.658574 1 358 38 476
0 0.77239 680 468 809 867
55 0.893587 0 252 34 327
0 0.907808 198 410 333 861
0 0.932 54 407 213 873
45 0.981999 0 84 810 820

3. Comparison of results

It can be found that for the same picture, the results of c++ and torch are different in terms of the number of targets and the values ​​of each value, which needs to be checked.

Troubleshooting and Resolution

1. Image preprocessing

According to the code, torch uses 640x* rectangular reasoning, and the filling part is 144:


   
   
  
  
  1. # utils/augmentations.py
  2. def letterbox( im, new_shape=(640, 640), color=(114, 114, 114), auto=True, scaleFill=False, scaleup=True, stride=32):
  3. # Resize and pad image while meeting stride-multiple constraints
  4. shape = im.shape[: 2] # current shape [height, width]
  5. if isinstance(new_shape, int):
  6. new_shape = (new_shape, new_shape)
  7. # Scale ratio (new / old)
  8. r = min(new_shape[ 0] / shape[ 0], new_shape[ 1] / shape[ 1])
  9. if not scaleup: # only scale down, do not scale up (for better val mAP)
  10. r = min (r, 1.0 )
  11. # Compute padding
  12. ratio = r, r # width, height ratios
  13. new_unpad = int( round(shape[ 1] * r)), int( round(shape[ 0] * r))
  14. dw, dh = new_shape[ 1] - new_unpad[ 0], new_shape[ 0] - new_unpad[ 1] # wh padding
  15. if auto: # minimum rectangle
  16. dw, dh = np.mod(dw, stride), np.mod(dh, stride) # wh padding
  17. elif scaleFill: # stretch
  18. dw, dh = 0.0 , 0.0
  19. new_unpad = (new_shape[ 1], new_shape[ 0])
  20. ratio = new_shape[ 1] / shape[ 1], new_shape[ 0] / shape[ 0] # width, height ratios
  21. dw /= 2 # divide padding into 2 sides
  22. dh /= 2
  23. if shape[::- 1] != new_unpad: # resize
  24. im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
  25. top, bottom = int( round(dh - 0.1)), int( round(dh + 0.1))
  26. left, right = int( round(dw - 0.1)), int( round(dw + 0.1))
  27. im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color) # add border
  28. return im, ratio, (dw, dh)

First rezise the input image to 640x* according to the long side, the method is bilinear interpolation, and then padding the short side to the minimum multiple of 32.

And c++ uses a 640x640 letterbox filled with 128:


   
   
  
  
  1. // preprocess.cu
  2. __global__ void warpaffine_kernel(~){
  3. ...
  4. float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;
  5. float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;
  6. ...
  7. }
  8. void preprocess_kernel_img(~){
  9. ...
  10. warpaffine_kernel<<<blocks, threads, 0, stream>>>(
  11. src, src_width* 3, src_width,
  12. src_height, dst, dst_width,
  13. dst_height, 128, d2s, jobs);
  14. }

鉴于c++修改为动态输入比较复杂,这里只将两者的640x640输入结果进行对齐。

关闭torch的矩形推理:


   
   
  
  
  1. # utils/augmentations.py --> letterbox(~)
  2. if auto: # minimum rectangle
  3. pass
  4. # dw, dh = np.mod(dw, stride), np.mod(dh, stride) # wh padding

修改c++的padding为114:


   
   
  
  
  1. // preprocess.cu
  2. void preprocess_kernel_img(~){
  3. ...
  4. warpaffine_kernel<<<blocks, threads, 0, stream>>>(
  5. src, src_width* 3, src_width,
  6. src_height, dst, dst_width,
  7. dst_height, 114, d2s, jobs);
  8. }

添加输出两者图片预处理后结果的代码进行查看:


   
   
  
  
  1. # utils/datasets.py
  2. class LoadImages:
  3. ...
  4. # Padded resize
  5. img = letterbox(img0, self.img_size, stride=self.stride, auto=self.auto)[ 0]
  6. # 输出从(400,400)位置开始的10x10区域的像素点rgb值
  7. for i in range( 400, 410):
  8. for j in range( 400, 410):
  9. print( "{}, {}, {}; ". format(img[i][j][ 0], img[i][j][ 1], img[i][j][ 2]), end= '')
  10. print()
  11. ...

   
   
  
  
  1. // yolov5.cpp
  2. // 图像预处理
  3. preprocess_kernel_img(img_device, img.cols, img.rows, buffer_idx, INPUT_W, INPUT_H, stream);
  4. // 预处理结果存到CPU
  5. float* recvCPU=( float*) malloc(size_image_dst* sizeof( float));
  6. CUDA_CHECK( cudaMemcpy(recvCPU, buffer_idx,size_image_dst* sizeof( float),cudaMemcpyDeviceToHost));
  7. cv::Mat resize_img(INPUT_H,INPUT_W,CV_8UC3);
  8. for ( int i = 0; i < INPUT_H; ++i){
  9. cv::Vec3b *p2 = resize_img. ptr<cv::Vec3b>(i);
  10. for ( int j = 0; j < INPUT_W; ++j){
  11. p2[j][ 2] = round(recvCPU[i*INPUT_W+j]* 255);
  12. p2[j][ 1] = round(recvCPU[INPUT_W*INPUT_H+i*INPUT_W+j]* 255);
  13. p2[j][ 0] = round(recvCPU[ 2*INPUT_W*INPUT_H+i*INPUT_W+j]* 255);
  14. }
  15. }
  16. for ( int i = 400; i < 410; i++) {
  17. uchar *data = resize_img.ptr <uchar> (i); //ptr function accesses the first address of any row of pixels, which is especially convenient for horizontal access of images row by row
  18. for ( int j = 400* 3; j < 400* 3+ 10* 3; j++) { // //在循环体内,应该避免多次运算,应该提前算cols*channels
  19. std::cout<<( int)data[j]<< ", ";
  20. }
  21. std::cout<< ""<<std::endl;
  22. }

对比两者图片预处理后输出结果:


   
   
  
  
  1. # torch
  2. 25, 1, 0; 25, 1, 0; 24, 1, 0; 25, 2, 0; 25, 2, 0; 24, 1, 0; 24, 1, 0; 25, 1, 0; 25, 2, 0; 26, 2, 1;
  3. 26, 0, 0; 27, 0, 3; 26, 1, 2; 25, 2, 0; 24, 1, 0; 24, 1, 0; 24, 1, 0; 27, 2, 0; 26, 0, 0; 26, 0, 0;
  4. 27, 0, 3; 26, 2, 4; 26, 0, 1; 26, 0, 0; 28, 2, 2; 27, 1, 1; 27, 1, 1; 27, 1, 1; 28, 2, 2; 28, 2, 2;
  5. 24, 0, 0; 25, 1, 1; 27, 1, 1; 28, 2, 2; 28, 2, 2; 27, 1, 1; 27, 1, 1; 27, 2, 1; 27, 2, 0; 27, 2, 0 ;
  6. 23, 2, 0; 23, 2, 0; 24, 1, 1; 25, 1, 1; 26, 2, 2; 25, 1, 1; 25, 2, 0; 24, 1, 0; 25, 2, 0; 26, 3, 1;
  7. 25, 1, 1; 25, 1, 1; 24, 0, 0; 25, 1, 1; 25, 2, 0; 25, 2, 0; 25, 2, 0; 26, 3, 1; 25, 2, 0; 25, 2, 0 ;
  8. 25, 1, 2; 26, 1, 2; 25, 1, 1; 24, 1, 0; 24, 2, 0; 24, 2, 0; 24, 2, 0; 24, 2, 0; 25, 3, 0; 26, 5, 0;
  9. 24, 0, 0; 25, 1, 2; 23, 1, 0; 23, 2, 0; 23, 2, 0; 23, 2, 0; 23, 2, 0; 24, 4, 2; 24, 4, 0; 24, 4, 0;
  10. 24, 3, 1; 22, 1, 0; 24, 3, 1; 23, 2, 1; 22, 1, 0; 23, 2, 0; 23, 3, 0; 24, 4, 0; 22, 2, 0; 25, 5, 1;
  11. 25, 3, 2; 23, 2, 1; 26, 5, 1; 26, 6, 2; 25, 4, 2; 28, 7, 5; 24, 3, 1; 29, 8, 6; 27, 6, 4; 28, 7, 4;
  12. // c++
  13. 26 , 1, 0, 26, 1, 0, 25, 1, 0, 25, 2, 0, 24, 1, 0, 24, 1, 0, 24, 1, 0, 25, 1, 0, 26, 2, 1, 27, 2, 1 ,
  14. 26, 0, 0, 27, 0, 3, 26, 2, 2, 25, 1, 0, 25, 1, 0, 24, 1, 0, 25, 1, 0, 27, 2, 0, 26, 0, 0, 26, 0, 0,
  15. 27, 0, 3, 26, 2, 4, 26, 0, 0, 26, 0, 0, 28, 2, 2, 27, 1, 1, 28, 2, 2, 27, 1, 1, 28, 2, 2, 28, 2, 2,
  16. 24, 0, 0, 25, 1, 1, 27, 1, 1, 29, 3, 3, 28, 2, 2, 27, 1, 1, 27, 1, 1, 27, 2, 0, 28, 3, 1, 28, 3, 1 ,
  17. 23, 2, 0, 23, 2, 0, 25, 1, 1, 26, 2, 2, 26, 2, 2, 25, 1, 1, 25, 2, 0, 25, 2, 0, 25, 2, 0, 26, 3, 1,
  18. 25, 1, 1, 25, 1, 1, 24, 0, 0, 25, 2, 1, 25, 2, 0, 25, 2, 0, 25, 2, 0, 26, 3, 1, 25, 2, 0, 25, 3, 0,
  19. 25, 1, 2, 26, 0, 2, 25, 1, 1, 24, 2, 0, 24, 2, 0, 24, 2, 0, 24, 2, 0, 24, 2, 0, 25, 4, 0, 26, 5, 0,
  20. 24, 1, 0, 24, 1, 2, 23, 2, 0, 23, 2, 0, 23, 2, 0, 23, 2, 0, 23, 2, 0, 24, 4, 1, 24, 4, 0, 24, 4, 0,
  21. 24, 3, 1, 22, 1, 0, 25, 4, 2, 23, 2, 0, 22, 1, 0, 23, 2, 0, 23, 3, 0, 24, 4, 0, 22, 2, 0, 26, 6, 1,
  22. 24, 3, 2, 23, 2, 1, 26, 6, 1, 26, 5, 2, 25, 4, 2, 29, 8, 6, 24, 3, 1, 30, 9, 7, 28, 7, 5, 29, 8, 6,

结果值仍不相同。

根据:一篇文章为你讲透双线性插值 - 知乎 可知,几何中心点重合对应公式:

因此对c++中双线性插值实现进行修改:


   
   
  
  
  1. // preprocess.cu
  2. __global__ void warpaffine_kernel(~){
  3. ...
  4. // float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;
  5. // float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;
  6. // 目标图像上的点对应于原图上的点的坐标
  7. float src_x = m_x1 * (dx+ 0.5f) + m_y1 * (dy+ 0.5f) + m_z1 - 0.5f;
  8. float src_y = m_x2 * (dx+ 0.5f) + m_y2 * (dy+ 0.5f) + m_z2 - 0.5f;
  9. ...
  10. }

对比两者图片预处理后输出结果:


   
   
  
  
  1. # torch
  2. 25, 1, 0; 25, 1, 0; 24, 1, 0; 25, 2, 0; 25, 2, 0; 24, 1, 0; 24, 1, 0; 25, 1, 0; 25, 2, 0; 26, 2, 1;
  3. 26, 0, 0; 27, 0, 3; 26, 1, 2; 25, 2, 0; 24, 1, 0; 24, 1, 0; 24, 1, 0; 27, 2, 0; 26, 0, 0; 26, 0, 0;
  4. 27, 0, 3; 26, 2, 4; 26, 0, 1; 26, 0, 0; 28, 2, 2; 27, 1, 1; 27, 1, 1; 27, 1, 1; 28, 2, 2; 28, 2, 2;
  5. 24, 0, 0; 25, 1, 1; 27, 1, 1; 28, 2, 2; 28, 2, 2; 27, 1, 1; 27, 1, 1; 27, 2, 1; 27, 2, 0; 27, 2, 0;
  6. 23, 2, 0; 23, 2, 0; 24, 1, 1; 25, 1, 1; 26, 2, 2; 25, 1, 1; 25, 2, 0; 24, 1, 0; 25, 2, 0; 26, 3, 1;
  7. 25, 1, 1; 25, 1, 1; 24, 0, 0; 25, 1, 1; 25, 2, 0; 25, 2, 0; 25, 2, 0; 26, 3, 1; 25, 2, 0; 25, 2, 0;
  8. 25, 1, 2; 26, 1, 2; 25, 1, 1; 24, 1, 0; 24, 2, 0; 24, 2, 0; 24, 2, 0; 24, 2, 0; 25, 3, 0; 26, 5, 0;
  9. 24, 0, 0; 25, 1, 2; 23, 1, 0; 23, 2, 0; 23, 2, 0; 23, 2, 0; 23, 2, 0; 24, 4, 2; 24, 4, 0; 24, 4, 0;
  10. 24, 3, 1; 22, 1, 0; 24, 3, 1; 23, 2, 1; 22, 1, 0; 23, 2, 0; 23, 3, 0; 24, 4, 0; 22, 2, 0; 25, 5, 1;
  11. 25, 3, 2; 23, 2, 1; 26, 5, 1; 26, 6, 2; 25, 4, 2; 28, 7, 5; 24, 3, 1; 29, 8, 6; 27, 6, 4; 28, 7, 4;
  12. // c++
  13. 25, 1, 0, 25, 1, 0, 25, 1, 0, 25, 2, 0, 25, 2, 0, 24, 1, 0, 24, 1, 0, 25, 1, 0, 26, 2, 0, 26, 2, 1,
  14. 26, 0, 0, 27, 0, 3, 26, 1, 2, 25, 2, 0, 24, 1, 0, 24, 1, 0, 25, 1, 0, 27, 2, 0, 26, 0, 0, 26, 0, 0,
  15. 27, 0, 3, 26, 2, 4, 26, 0, 1, 26, 0, 0, 28, 2, 2, 27, 1, 1, 27, 1, 1, 27, 1, 1, 28, 2, 2, 28, 2, 2,
  16. 24, 0, 0, 25, 1, 1, 27, 1, 1, 28, 2, 2, 28, 2, 2, 27, 1, 1, 27, 1, 1, 27, 2, 1, 27, 2, 0, 27, 2, 0,
  17. 23, 2, 0, 23, 2, 0, 24, 1, 1, 25, 1, 1, 26, 2, 2, 25, 1, 1, 25, 2, 0, 24, 1, 0, 25, 2, 0, 26, 3, 1,
  18. 25, 1, 1, 25, 1, 1, 24, 0, 0, 25, 1, 1, 25, 2, 0, 25, 2, 0, 25, 2, 0, 26, 3, 1, 25, 2, 0, 25, 2, 0,
  19. 25, 1, 2, 26, 1, 2, 25, 1, 1, 24, 2, 0, 24, 2, 0, 24, 2, 0, 24, 2, 0, 24, 2, 0, 25, 3, 1, 26, 5, 0,
  20. 24, 1, 0, 25, 1, 3, 23, 2, 0, 23, 2, 0, 23, 2, 0, 23, 2, 0, 23, 2, 0, 25, 4, 2, 24, 4, 0, 24, 4, 0,
  21. 24, 3, 1, 22, 1, 0, 24, 3, 1, 23, 2, 1, 22, 1, 0, 23, 2, 0, 23, 3, 0, 24, 4, 0, 23, 3, 0, 25, 5, 1,
  22. 25, 4, 2, 23, 2, 1, 26, 5, 1, 26, 6, 2, 25, 4, 2, 28, 7, 5, 24, 3, 2, 29, 8, 6, 27, 6, 4, 28, 7, 5,

结果基本相同,仍有些微不同,至此图像预处理结果对齐完成。

2. 网络结构

对比torch和c++两者的网络结构实现,无异常。关注BN层的参数,torch中为默认参数:


   
   
  
  
  1. # models/commom.py
  2. self.bn = nn.BatchNorm2d(c2)

其中eps为1e-5.

c++的BN层eps为1e-3:


   
   
  
  
  1. // common.hpp
  2. IScaleLayer* bn = addBatchNorm2d(network, weightMap, *cat-> getOutput( 0), lname + ".bn", 1e-3);

进行相应修改。

3. 网络输出后处理

torch:


   
   
  
  
  1. # utils/general.py
  2. def non_max_suppression( prediction, conf_thres=0.25, iou_thres=0.45, classes=None, agnostic=False, multi_label=False, labels=(), max_det=300):
  3. """Runs Non-Maximum Suppression (NMS) on inference results
  4. Returns:
  5. list of detections, on (n,6) tensor per image [xyxy, conf, cls]
  6. """
  7. nc = prediction.shape[ 2] - 5 # number of classes
  8. xc = prediction[..., 4] > conf_thres #obj_conf>conf_thres
  9. # Checks
  10. assert 0 <= conf_thres <= 1, f'Invalid Confidence threshold {conf_thres}, valid values are between 0.0 and 1.0'
  11. assert 0 <= iou_thres <= 1, f'Invalid IoU {iou_thres}, valid values are between 0.0 and 1.0'
  12. # Settings
  13. min_wh, max_wh = 2, 7680 # (pixels) minimum and maximum box width and height
  14. max_nms = 30000 # maximum number of boxes into torchvision.ops.nms()
  15. time_limit = 10.0 # seconds to quit after
  16. redundant = True # require redundant detections
  17. multi_label &= nc > 1 # multiple labels per box (adds 0.5ms/img)
  18. merge = False # use merge-NMS
  19. t = time.time()
  20. output = [torch.zeros(( 0, 6), device=prediction.device)] * prediction.shape[ 0]
  21. for xi, x in enumerate(prediction): # image index, image inference
  22. # Apply constraints
  23. x[((x[..., 2: 4] < min_wh) | (x[..., 2: 4] > max_wh)). any( 1), 4] = 0 # width-height
  24. x = x[xc[xi]] # confidence
  25. # Cat apriori labels if autolabelling
  26. if labels and len(labels[xi]):
  27. lb = labels[xi]
  28. v = torch.zeros(( len(lb), nc + 5), device=x.device)
  29. v[:, : 4] = lb[:, 1: 5] # box
  30. v[:, 4] = 1.0 # conf
  31. v[ range( len(lb)), lb[:, 0].long() + 5] = 1.0 # cls
  32. x = torch.cat((x, v), 0)
  33. # If none remain process next image
  34. if not x.shape[ 0]:
  35. continue
  36. # Compute conf
  37. x[:, 5:] *= x[:, 4: 5] # conf = obj_conf * cls_conf
  38. # Box (center x, center y, width, height) to (x1, y1, x2, y2)
  39. box = xywh2xyxy(x[:, : 4])
  40. # Detections matrix nx6 (xyxy, conf, cls)
  41. if multi_label:
  42. i, j = (x[:, 5:] > conf_thres).nonzero(as_tuple= False).T
  43. x = torch.cat((box[i], x[i, j + 5, None], j[:, None]. float()), 1)
  44. else: # best class only
  45. conf, j = x[:, 5:]. max( 1, keepdim= True)
  46. x = torch.cat((box, conf, j. float()), 1)[conf.view(- 1) > conf_thres] # conf>conf_thres
  47. # Filter by class
  48. if classes is not None:
  49. x = x[(x[:, 5:6] == torch.tensor(classes, device=x.device)).any(1)]
  50. # Apply finite constraint
  51. # if not torch.isfinite(x).all():
  52. # x = x[torch.isfinite(x).all(1)]
  53. # Check shape
  54. n = x.shape[0] # number of boxes
  55. if not n: # no boxes
  56. continue
  57. elif n > max_nms: # excess boxes
  58. x = x[x[:, 4].argsort(descending=True)[:max_nms]] # sort by confidence
  59. ...

The maximum number of boxes output by the network does not exceed max_nms=30000, and the obj_conf of each box must be greater than conf_thres=0.25, and the total conf(=obj_conf*cls_conf) must also be greater than conf_thres.

c++:


   
   
  
  
  1. // yololayer.cu
  2. __global__ void CalDetection(~){
  3. ...
  4. for (int k = 0; k < CHECK_COUNT; ++k) {
  5. ...
  6. if (box_prob < IGNORE_THRESH) continue;
  7. ...
  8. int count = (int)atomicAdd(res_count, 1);
  9. if (count >= maxoutobject) return;
  10. ...
  11. }
  12. ...
  13. }

Only obj_conf, without proofreading the total conf, add:


   
   
  
  
  1. // yololayer.cu
  2. __global__ void CalDetection(~) {
  3. ...
  4. float max_cls_prob = 0.0;
  5. for ...
  6. if (box_prob * max_cls_prob < IGNORE_THRESH) continue; // conf < thres
  7. ...
  8. }

4. nms post-processing

torch:


   
   
  
  
  1. # utils/general.py
  2. def non_max_suppression(prediction, conf_thres=0.25, iou_thres=0.45, classes=None, agnostic=False, multi_label=False, labels=(), max_det=300):
  3. ...
  4. # Batched NMS
  5. c = x[:, 5: 6] * ( 0 if agnostic else max_wh) # classes
  6. boxes, scores = x[:, : 4] + c, x[:, 4] # boxes (offset by class), scores
  7. i = torchvision.ops. nms(boxes, scores, iou_thres) # NMS
  8. if i.shape[ 0] > max_det: # limit detections
  9. i = i[:max_it]
  10. if merge and ( 1 < n < 3E3): # Merge NMS (boxes merged using weighted mean)
  11. # update boxes as boxes(i,4) = weights(i,n) * boxes(n,4)
  12. iou = box_iou(boxes[i], boxes) > iou_thres # iou matrix
  13. weights = iou * scores[None] # box weights
  14. x[i, : 4] = torch. mm(weights, x[:, : 4]). float() / weights. sum( 1, keepdim=True) # merged boxes
  15. if redundant:
  16. i = i[iou. sum( 1) > 1] # require redundancy
  17. output[xi] = x[i]
  18. ...

After nms, if there are more than max_det=1000 boxes, only the first 1000 boxes from high to low in conf will be saved.

c++, add collation to output quantity:


   
   
  
  
  1. // commom.hpp
  2. void nms(~){
  3. ...
  4. for ( auto it = m. begin(); it != m. end(); it++) {
  5. ...
  6. // Only save the first 1000 results of conf
  7. std:: sort(res. begin(), res. end(), cmp);
  8. if(res. size()>Yolo::MAX_OUTPUT_BBOX_COUNT){
  9. res. erase(res. begin()+Yolo::MAX_OUTPUT_BBOX_COUNT, res. end());
  10. }
  11. }
  12. }

5. Coordinate transformation post-processing

torch:


   
   
  
  
  1. # utils/general.py
  2. def scale_coords(img1_shape, coords, img0_shape, ratio_pad=None):
  3. # Rescale coords (xyxy) from img1_shape to img0_shape
  4. if ratio_pad is None: # calculate from img0_shape
  5. gain = min (img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1]) # gain = old / new
  6. pad = (img1_shape[1] - img0_shape[1] * gain) / 2, (img1_shape[0] - img0_shape[0] * gain) / 2# wh padding
  7. else :
  8. gain = ratio_pad[0][0]
  9. pad = ratio_pad[1]
  10. coords[:, [0, 2]] -= pad[0] # x padding
  11. coords[:, [1, 3]] -= pad[1] # y padding
  12. coords[:, :4] /= gain
  13. clip_coords(coords, img0_shape)
  14. return coords
  15. def clip_coords(boxes, shape):
  16. # Clip bounding xyxy bounding boxes to image shape (height, width)
  17. if isinstance (boxes, torch.Tensor): # faster individually
  18. boxes[:, 0].clamp_(0, shape[1]) # x1
  19. boxes[:, 1]. clamp_( 0, shape[ 0]) # y1
  20. boxes[:, 2].clamp_(0, shape[1]) # x2
  21. boxes[:, 3].clamp_(0, shape[0]) # y2
  22. else : # np.array (faster grouped)
  23. boxes[:, [0, 2]] = boxes[:, [0, 2]].clip(0, shape[1]) # x1, x2
  24. boxes[:, [ 1, 3]] = boxes[:, [ 1, 3]]. clip( 0, shape[ 0]) # y1, y2

c++:


   
   
  
  
  1. // common.hpp
  2. cv::Rect get_rect(cv::Mat& img, float bbox[4]) {
  3. float l, r, t, b;
  4. float r_w = Yolo::INPUT_W / (img.cols * 1.0);
  5. float r_h = Yolo::INPUT_H / (img.rows * 1.0);
  6. if (r_h > r_w) {
  7. l = bbox[ 0 ] - bbox[ 2 ] / 2.f ;
  8. r = bbox[ 0 ] + bbox[ 2 ] / 2.f ;
  9. t = bbox[1] - bbox[3] / 2.f - (Yolo::INPUT_H - r_w * img.rows) / 2;
  10. b = bbox[1] + bbox[3] / 2.f - (Yolo::INPUT_H - r_w * img.rows) / 2;
  11. l = l / r_w;
  12. r = r / r_w;
  13. t = t / r_w;
  14. b = b / r_w;
  15. } else {
  16. l = bbox[ 0] - bbox[ 2] / 2.f - (Yolo::INPUT_W - r_h * img.cols) / 2;
  17. r = bbox[ 0] + bbox[ 2] / 2.f - (Yolo::INPUT_W - r_h * img.cols) / 2;
  18. t = bbox[ 1 ] - bbox[ 3 ] / 2.f ;
  19. b = bbox[ 1 ] + bbox[ 3 ] / 2.f ;
  20. l = l / r_h;
  21. r = r / r_h;
  22. t = t / r_h;
  23. b = b / r_h;
  24. }
  25. return cv:: Rect( round(l), round(t), round(r - l), round(b - t));
  26. }

The method of conversion is slightly different, and no judgment is made on the coordinates out of bounds.

After modification:


   
   
  
  
  1. // common.hpp
  2. float clip_coords(float x, int xmin, int xmax) {
  3. if (x < xmin) {
  4. x = xmin;
  5. }
  6. if (x > xmax ){
  7. x = xmax;
  8. }
  9. return x;
  10. }
  11. // yolov5/utils/general.py xywh2xyxy(~) and scale_coords(~)
  12. cv::Rect get_rect(cv::Mat& img, float bbox[4]) {
  13. // xc,yc,w,h --> xmin,ymin,xmax,ymax
  14. float l, r, t, b;
  15. l = bbox[ 0 ] - bbox[ 2 ] / 2.f ;
  16. r = bbox[ 0 ] + bbox[ 2 ] / 2.f ;
  17. t = bbox[ 1 ] - bbox[ 3 ] / 2.f ;
  18. b = bbox[ 1 ] + bbox[ 3 ] / 2.f ;
  19. // Rescale coords (xyxy) from dst shape(640x640) to src shape
  20. float pad[ 2];
  21. float gain = std:: min( ( float)Yolo::INPUT_W / ( float)img.cols, ( float)Yolo::INPUT_H / ( float)img.rows);
  22. pad[ 0] = (Yolo::INPUT_W - img.cols * gain)/ 2;
  23. pad[ 1] = (Yolo::INPUT_H - img.rows * gain)/ 2;
  24. l = ( l - pad[ 0] ) / gain; // x padding
  25. r = ( r - pad[ 0] ) / gain;
  26. t = ( t - pad[ 1] ) / gain; // y padding
  27. b = ( b - pad[ 1] ) / gain;
  28. // out of bounds
  29. l = clip_coords(l, 0, img.cols);
  30. r = clip_coords(r, 0, img.cols);
  31. t = clip_coords(t, 0, img.rows);
  32. b = clip_coords(b, 0, img.rows);
  33. // xmin,ymin,xmax,ymax --> xmin, ymin, w, h
  34. return cv:: Rect( round(l), round(t), round(r - l), round(b - t));
  35. }

6. Result comparison

c++:
45 0.973848 0 126 810 797
0 0.931803 50 408 215 875
55 0.923260 0 254 33 328
0 0.922524 215 412 323 863
0 0.917015 677 485 810 871
0 0.883489 0 622 64 868
3 0.594060 119 768 156 816
torch:
3 0.253175 120 767 158 815
0 0.859743 677 484 810 872
0 0.863529 1 620 63 868
55 0.906701 0 254 33 328
0 0.907883 214 412 321 861
0 0.922536 52 408 214 876
45 0.962665 0 106 810 813

The number of predicted targets is the same, and the coordinate values ​​are basically corresponding. Although the confidence is different, C++ is generally higher than torch.

Guess you like

Origin blog.csdn.net/gdxb666/article/details/130816452