TensorRT learning (2): Pytorch model to TensorRT C++ process

foreword

  https://github.com/wang-xinyu/pytorchx
  https://github.com/wang-xinyu/tensorrtx
  This article selects two simple examples of lenet and mlp to learn TensorRT's C++ API.
  The code part will skip assertthe judgment and destroyrelease space parts, and only keep the more core parts.

1. pytorchx

model.pyThere are two files with and   in pytorchx inference.py. model.pyUsed to build the pytorch model and save the weights model.pth, inference.pyconvert model.pthto model.wts.
  This model.wtsis not complicated, it just saves the weight of the model in text format for reading the weight of the model in C++:

# mlp
2
linear.weight 1 3fffdecf
linear.bias 1 3b16e580

# lenet
10
conv1.weight 150 be40ee1b bd20baba ...
conv1.bias 6 bd32705a 3e2182a8 ...
conv2.weight 2400 3c6f2224 3c69308f ...
conv2.bias 16 bd183967 bcb1ac89 ...
fc1.weight 48000 3c162c20 bd25196a ...
fc1.bias 120 3d3c3d4a bc64b947 ...
fc2.weight 10080 bce095a3 3d33b9dc ...
fc2.bias 84 bc71eaa0 3d9b276d ...
fc3.weight 840 3c25286d 3d855351 ...
fc3.bias 10 bdbe4bb8 3b119ed1 ...

  That’s the end of what Python does. The code is easier and not analyzed, and the painful C++ link follows.

2. tensorrtx

2.1 cmake

  We will not do in-depth research on this process for the time being. The core is CMakeLists.txtthat the content is similar, and the main content is the fancy model.cppcontent. Click here to see the two parts ./model -sof generating engine model.engineand ./model -drunning inference.

2.2 Generating Engine

2.2.1 Main process

  The main process of these two models is the same: Create modelStream → \toAPIToModel Build a modelwith→ \to writemodel.engine

IHostMemory* modelStream{
    
    nullptr};
APIToModel(1, &modelStream);
std::ofstream p("../model.engine", std::ios::binary);
p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());

2.2.2 APIToModel

  The process is still the same, read the configuration information from logging.hit , and create the model. The author's description of this header file is A logger file for using NVIDIA TRT API (mostly same for all models), it should be used directly, skip it for now. It can be seen that the core lies in createModelEngine.

void APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream){
    
    
	// Create builder with the help of logger
	IBuilder *builder = createInferBuilder(gLogger);
	// Create hardware configs
	IBuilderConfig *config = builder->createBuilderConfig();
	// Build an engine
	ICudaEngine* engine = createModelEngine(maxBatchSize, builder, config, DataType::kFLOAT);
	assert(engine != nullptr);
	// serialize the engine into binary stream
	(*modelStream) = engine->serialize();
	// free up the memory
    engine->destroy();
    builder->destroy();
}

2.2.3 createModelEngine

(1)
  In addition to the network construction part of the process, only the setting of the workspace size is different. There is a description in the official blogsetMaxWorkspaceSize , which is roughly the allocation of GPU space, 1ULL << 30which is 1GB, and the unit of allocation space is bytes.

"两个函数的输入是相同的"
ICudaEngine* createMLPEngine(unsigned int maxBatchSize, IBuilder *builder, IBuilderConfig *config, DataType dt)
ICudaEngine* createLenetEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)

"读取权重, wts文件的格式是相同的, 估计是通用的, 读进来是个字典, 跳过读取细节"
std::map<std::string, Weights> weightMap = loadWeights("../mlp.wts");
std::map<std::string, Weights> weightMap = loadWeights("../lenet5.wts");
"创建空模型"
INetworkDefinition* network = builder->createNetworkV2(0U);
INetworkDefinition* network = builder->createNetworkV2(0U);
"创建输入, name 和 type 是一样的, lenet的维度是 1, 28, 28"
"这里看起来比pytorch的输入少了一个batchsize的维度, 但在后面有 setMaxBatchSize"
ITensor* data = network->addInput("data", DataType::kFLOAT, Dims3{
    
    1, 1, 1});
ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{
    
    1, INPUT_H, INPUT_W});
"构建网络, 后面单独说明"

"Set configurations"
builder->setMaxBatchSize(1);
builder->setMaxBatchSize(maxBatchSize);
"Set workspace size"
config->setMaxWorkspaceSize(1 << 20);
config->setMaxWorkspaceSize(16 << 20);
"Build CUDA Engine using network and configurations"
ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);

(2) Build the network part

// mlp
IFullyConnectedLayer *fc1 = network->addFullyConnected(*data, 1, weightMap["linear.weight"], weightMap["linear.bias"]);
fc1->getOutput(0)->setName("out");
network->markOutput(*fc1->getOutput(0));

  The structure of mlp is too simple, just look at lenet. First review the structure of lenet:
  [ 1 , 1 , 32 , 32 ] [1,1,32,32][1,1,32,32] → \to Conv2d(1,6,5,1,0)+relu → \to [ 1 , 6 , 28 , 28 ] [1,6,28,28] [1,6,28,28] → \to AvgPool2d(2,2,0) → \to [ 1 , 6 , 14 , 14 ] [1,6,14,14] [1,6,14,14] → \to Conv2d(6,16,5,1,0)+relu → \to [ 1 , 16 , 10 , 10 ] [1,16,10,10] [1,16,10,10] → \to AvgPool2d(2,2,0) → \to [ 1 , 16 , 5 , 5 ] [1,16,5,5] [1,16,5,5] → \to [ 1 , 400 ] [1,400] [1,400] → \to Linear(400,120)+relu → \to [ 1 , 120 ] [1,120] [1,120] → \to Linear(120,84)+relu → \to [ 1 , 84 ] [1,84] [1,84] → \to Linear(84,10)+softmax → \to [ 1 , 10 ] [1,10] [1,10]

  Corresponding to the API of pytorch, we can see the usage of tensorrt API:
Convolution: addConvolutionNd(输入, 输出维度, kernel大小, 权重, 偏置)
Activation: addActivation(输入, 激活函数类型)
Pooling: addPoolingNd(输入, 池化模式, kernel大小)
setStrideNdUsed to set the step size of convolution and pooling Full
connection:addFullyConnected(输入, 输出维度, 权重, 偏置)

  Finally, we give the output of the last layer a name and let the network mark it as an output.

// lenet
IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 6, DimsHW{
    
    5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]);
conv1->setStrideNd(DimsHW{
    
    1, 1});
IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);
IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{
    
    2, 2});
pool1->setStrideNd(DimsHW{
    
    2, 2});

IConvolutionLayer* conv2 = network->addConvolutionNd(*pool1->getOutput(0), 16, DimsHW{
    
    5, 5}, weightMap["conv2.weight"], weightMap["conv2.bias"]);
conv2->setStrideNd(DimsHW{
    
    1, 1});
IActivationLayer* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);
IPoolingLayer* pool2 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kAVERAGE, DimsHW{
    
    2, 2});
pool2->setStrideNd(DimsHW{
    
    2, 2});

IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);
IActivationLayer* relu3 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);
IFullyConnectedLayer* fc2 = network->addFullyConnected(*relu3->getOutput(0), 84, weightMap["fc2.weight"], weightMap["fc2.bias"]);
IActivationLayer* relu4 = network->addActivation(*fc2->getOutput(0), ActivationType::kRELU);
IFullyConnectedLayer* fc3 = network->addFullyConnected(*relu4->getOutput(0), OUTPUT_SIZE, weightMap["fc3.weight"], weightMap["fc3.bias"]);

ISoftMaxLayer* prob = network->addSoftMax(*fc3->getOutput(0));
prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
network->markOutput(*prob->getOutput(0));

2.3 Running inference

2.3.1 Main process

  The entire reasoning process doInferenceuses the tensorrt API except for , and it can be seen that only the ready-made logging.hand are needed model.engine.

// create a model using the API directly and serialize it to a stream
char *trtModelStream{
    
    nullptr};
size_t size{
    
    0};
// read model from the engine file
std::ifstream file("../model.engine", std::ios::binary);
if (file.good()) {
    
    
    file.seekg(0, file.end);
    size = file.tellg();
    file.seekg(0, file.beg);
    trtModelStream = new char[size];
    assert(trtModelStream);
    file.read(trtModelStream, size);
    file.close();
}
// create a runtime (required for deserialization of model) with NVIDIA's logger
IRuntime *runtime = createInferRuntime(gLogger);
// deserialize engine for using the char-stream
                               deserializeCudaEngine
ICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);
// create execution context -- required for inference executions
IExecutionContext *context = engine->createExecutionContext();
// 创建输入输出
float data[INPUT_SIZE];		// mlp:1	lenet:H*W
float out[OUTPUT_SIZE];		// mlp:1	lenet:10

// time the execution
auto start = std::chrono::system_clock::now();
// do inference using the parameters
doInference(*context, data, out, 1);
// time the execution
auto end = std::chrono::system_clock::now();

2.3.2 doInference

  The lenet and mlp doInferenceare also the same, the reasoning is actually just one sentence context.enqueue(batchSize, buffers, stream, nullptr);, and the rest of the work is to create and store input and output space, create cuda stream, and move data between CPU and GPU.
  After looking at the code of yolov5, running reasoning is actually just one sentence, and the main work is to prepare data.

void doInference(IExecutionContext &context, float *input, float *output, int batchSize) {
    
    
    // Get engine from the context
    const ICudaEngine &engine = context.getEngine();

    // Pointers to input and output device buffers to pass to engine.
    // Engine requires exactly IEngine::getNbBindings() number of buffers.
    assert(engine.getNbBindings() == 2);
    void *buffers[2];

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine.getBindingIndex("data");
    const int outputIndex = engine.getBindingIndex("out");

    // Create GPU buffers on device -- allocate memory for input and output
    cudaMalloc(&buffers[inputIndex], batchSize * INPUT_SIZE * sizeof(float));
    cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float));

    // create CUDA stream for simultaneous CUDA operations
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    // copy input from host (CPU) to device (GPU)  in stream
    cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_SIZE * sizeof(float), cudaMemcpyHostToDevice, stream);

    // execute inference using context provided by engine
    context.enqueue(batchSize, buffers, stream, nullptr);

    // copy output back from device (GPU) to host (CPU)
    cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost,
                    stream);

    // synchronize the stream to prevent issues
    //      (block CUDA and wait for CUDA operations to be completed)
    cudaStreamSynchronize(stream);

    // Release stream and buffers (memory)
    cudaStreamDestroy(stream);
    cudaFree(buffers[inputIndex]);
    cudaFree(buffers[outputIndex]);
}

Guess you like

Origin blog.csdn.net/weixin_43605641/article/details/127619437