[CV study notes] tensorrtx-yolov5 line-by-line code analysis

1 Introduction

TensorRTx (hereinafter referred to as trtx) is a very popular open source library that uses API to build network structures to achieve trt acceleration. The author mentioned why not use ONNX parser for trt acceleration, but use the lowest level API to build trt acceleration. The reasons are as follows:

Flexible makes it easy to modify any layer of the model, such as deleting, adding, replacing, etc.
Debuggable can easily obtain the results of a certain layer in the middle of the model
Chance to learn can provide a further understanding of the model structure

Although the onnx2trt method currently does not cause problems in most cases, under trtx, we can master the lower-level principles and codes, which is beneficial to our deployment and optimization of the model. The following will use the example of yolov5s under the trtx framework to analyze how trtx works line by line.

TensorRTx project link: https://github.com/wang-xinyu/tensorrtx.

2. Step analysis

In trtx, the process of accelerating a model can be divided into two steps:

Extract pytorch model parameters wts
Use the trt underlying API to build the network structure and fill the parameters in wts into the network

2.1、get_wts.py

First, you need to extract the model parameters in pytorch. The model parameters in pytorch exist in the format of blob in caffe. Each operation has a corresponding name, data length, and data.

for k, v in model.state_dict().items():
    # k-> blob的名字
    vr = v.reshape(-1).cpu().numpy() # vr -> 数据长度
    f.write('{} {} '.format(k, len(vr)))
    for vv in vr:
        f.write(' ')
        f.write(struct.pack('>f', float(vv)).hex()) # 将数据转化到16进制
        f.write('\n')

By loading get_wts.py, you can get the model parameters including yolov5s.pth. Open yolov5s.wts as shown below:

Insert image description here

The 351 in the first line is the total number of blobs, the model.0.conv.weight in the second line is the name of the first blob, 3456 represents the data length of the blob, and 3a198000 3ca58000... are the actual parameters.

After obtaining the above parameters, you can accelerate in trtx mode.

2.2. Construct engine

Before using wts to convert to engine, you need to be very clear about the network structure of the model. Students who are not sure can refer to the network structure diagram of yolov5 by Sunflower's Little Mung Bean . After understanding the network structure of yolov5, you can start using trt's API to build a network model. The code for building the model is in the build_det_engine function in model.cpp. This article draws the code process directly into the network structure diagram of yolov5. You can directly compare the code and diagram to view it.
Insert image description here

//yolov5_det.cpp
viod serialize_engine(...){
    
    
	if (is_p6) {
    
    
        ...
	} else {
    
    
        // 以yolov5s为例
        engine = build_det_engine(max_batchsize, builder, config, DataType::kFLOAT, gd, gw, wts_name);
  	}
    // 序列化
    IHostMemory* serialized_engine = engine->serialize();
    std::ofstream p(engine_name, std::ios::binary);
    // 写到文件中
    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());

}

model.cpp

// 解析get_wts.py
static std::map<std::string, Weights> loadWeights(const std::string file) {
    
    
    int32_t count;  // wts文件第一行，共有351个blob
  	input >> count;
    //每一行是一个blob,模型名称 + 数据长度 + 参数
    while (count--) {
    
    
        // 一个blob的参数
     	Weights wt{
    
     DataType::kFLOAT, nullptr, 0 };
        uint32_t size;  //blob 数据长度
        std::string name; // blob 数据名字
        for (uint32_t x = 0, y = size; x < y; ++x) {
    
    
      		input >> std::hex >> val[x];  // 将数据转化成十进制，并放到val中
    	}
        // 每个blob名字对应一个wt
        weightMap[name] = wt;
    }
}


ICudaEngine* build_det_engine(){
    
    
   // 初始化网络结构
   INetworkDefinition* network = builder->createNetworkV2(0U);
   // 定义模型输入
   ITensor* data = network->addInput(kInputTensorName, dt, Dims3{
    
     3, kInputH, kInputW });
   // 加载pytorch模型中的参数
   std::map<std::string, Weights> weightMap = loadWeights(wts_name);
    
   // 逐步添加网络结构,已将代码与网络结构一一对应 ,具体过程见上图
 
   // 增加yolo后处理decode模块，使用了plugin
   auto yolo = addYoLoLayer(network, weightMap, "model.24", std::vector<IConvolutionLayer*>{
    
    det0, det1, det2});
   network->markOutput(*yolo->getOutput(0));  //将plugin的输出设置为模型的最后输出（decode）
    
   #if defined(USE_FP16)
  	// FP16
	config->setFlag(BuilderFlag::kFP16);
   #elif defined(USE_INT8)
    // INT8 量化
    std::cout << "Your platform support int8: " << (builder->platformHasFastInt8() ? "true" : "false") << std::endl;
    assert(builder->platformHasFastInt8());
    config->setFlag(BuilderFlag::kINT8);
    Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, "./coco_calib/", "int8calib.table", kInputTensorName);
      config->setInt8Calibrator(calibrator);
    #endif
    // 根据网络结构来生成engine
    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
	return engine;
}

3、plugin

I am also learning about the plugin. The following is my basic understanding of the plugin while learning the trtx-yolo5 code. The original author added a model decoding plugin behind the model to obtain the bbox on each feature layer. The calling code is in model.cpp.

auto yolo = addYoLoLayer(network, weightMap, "model.24", std::vector<IConvolutionLayer*>{
    
    det0, det1, det2});

static IPluginV2Layer* addYoLoLayer(...){
    
    
    // 注册一个名为 "YoloLayer_TRT"的插件，如果找不到插件，就会报错
    auto creator = getPluginRegistry()->getPluginCreator("YoloLayer_TRT", "1");
    
    // plugin的数据
    PluginField plugin_fields[2];
    int netinfo[5] = {
    
    kNumClass, kInputW, kInputH, kMaxNumOutputBbox, (int)is_segmentation};  //维度数据
  	plugin_fields[0].data = netinfo;  
  	plugin_fields[0].length = 5; 
  	plugin_fields[0].name = "netinfo";
  	plugin_fields[0].type = PluginFieldType::kFLOAT32;
    
    // 所有plugin的参数
    PluginFieldCollection plugin_data;
  	plugin_data.nbFields = 2;
  	plugin_data.fields = plugin_fields;
    // 创建plugin的对象 
    IPluginV2 *plugin_obj = creator->createPlugin("yololayer", &plugin_data);
}

The implementation code is in yololayer.h/cu

class API YoloLayerPlugin : public IPluginV2IOExt {
    	
    // 设置插件名称，在注册插件时会寻找对应的插件
      const char* getPluginType() const TRT_NOEXCEPT override{
          return "YoloLayer_TRT";
      }

    
    //插件构造函数
	YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, bool is_segmentation, const std::vector<YoloKernel>& vYoloKernel){
      /*
      	classCount:类别数量
      	netWidth:输入宽
      	netHeight:输入高
      	maxOut:最大检测数量
      	is_segmentation:是否含有实例分割
      	vYoloKernel:anchors参数
      */
    }
    
}

// 插件运行时调用的代码
void YoloLayerPlugin::forwardGpu(...){
    // 输出结果 1+ 是在第一个位置记录解码的数量
    int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);
    
    // 将存放结果的内存置为0
    for (int idx = 0; idx < batchSize; ++idx) {
    	CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));
 
    // 遍历三种不同尺度的anchor
    for (unsigned int i = 0; i < mYoloKernel.size(); ++i) {
        // 调用核函数进行解码
     	CalDetection << < (numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream >> >(...)
    }
    
}

__global__ void CalDetection(...){
    // input:模型输出结果
    // output:decode存放地址
    // 当前线程的的全局索引ID
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    // yoloWidth * yoloHeight
    int total_grid = yoloWidth * yoloHeight; // 在当前特征层上要处理的总框数
    int bnIdx = idx / total_grid;    // 第n个batch    
    // x,y,w,h,score + 80
    int info_len_i = 5 + classes;
    // 如果带有实例分割分析，需要再加上32个分割系数
    if (is_segmentation) info_len_i += 32;
    
    // 第n个batch的推理结果开始地址
    const float* curInput = input + bnIdx * (info_len_i * total_grid * kNumAnchor);
    // 遍历三种不同尺寸的anchor
    for (int k = 0; k < kNumAnchor; ++k) {
        //每个框的置信度
    	float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);
        if (box_prob < kIgnoreThresh) continue;
        for (int i = 5; i < 5 + classes; ++i) {
            // 每个类别的概率
        	float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);
            // 提取最大概率以及类别ID
            if (p > max_cls_prob) {
        		max_cls_prob = p;
        		class_id = i - 5;
      		}
        }
        // 
        float *res_count = output + bnIdx * outputElem;
        // 统计decode框的数量	
        int count = (int)atomicAdd(res_count, 1);
		// 下面是按照论文的公式将预测的宽和高恢复到原图大小
		...
    }
}

4. Summary

Through this in-depth study of the trtx open source code, I learned how to use trt's API to accelerate the model, and also learned about the implementation of the plugin. I will continue to learn the knowledge points in trtx in the future.