foreword
https://github.com/wang-xinyu/pytorchx
https://github.com/wang-xinyu/tensorrtx
This article selects two simple examples of lenet and mlp to learn TensorRT's C++ API.
The code part will skip assert
the judgment and destroy
release space parts, and only keep the more core parts.
1. pytorchx
model.py
There are two files with and in pytorchx inference.py
. model.py
Used to build the pytorch model and save the weights model.pth
, inference.py
convert model.pth
to model.wts
.
This model.wts
is not complicated, it just saves the weight of the model in text format for reading the weight of the model in C++:
# mlp
2
linear.weight 1 3fffdecf
linear.bias 1 3b16e580
# lenet
10
conv1.weight 150 be40ee1b bd20baba ...
conv1.bias 6 bd32705a 3e2182a8 ...
conv2.weight 2400 3c6f2224 3c69308f ...
conv2.bias 16 bd183967 bcb1ac89 ...
fc1.weight 48000 3c162c20 bd25196a ...
fc1.bias 120 3d3c3d4a bc64b947 ...
fc2.weight 10080 bce095a3 3d33b9dc ...
fc2.bias 84 bc71eaa0 3d9b276d ...
fc3.weight 840 3c25286d 3d855351 ...
fc3.bias 10 bdbe4bb8 3b119ed1 ...
That’s the end of what Python does. The code is easier and not analyzed, and the painful C++ link follows.
2. tensorrtx
2.1 cmake
We will not do in-depth research on this process for the time being. The core is CMakeLists.txt
that the content is similar, and the main content is the fancy model.cpp
content. Click here to see the two parts ./model -s
of generating engine model.engine
and ./model -d
running inference.
2.2 Generating Engine
2.2.1 Main process
The main process of these two models is the same: Create modelStream
→ \to→APIToModel
Build a modelwith→ \to→ writemodel.engine
IHostMemory* modelStream{
nullptr};
APIToModel(1, &modelStream);
std::ofstream p("../model.engine", std::ios::binary);
p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());
2.2.2 APIToModel
The process is still the same, read the configuration information from logging.h
it , and create the model. The author's description of this header file is A logger file for using NVIDIA TRT API (mostly same for all models), it should be used directly, skip it for now. It can be seen that the core lies in createModelEngine
.
void APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream){
// Create builder with the help of logger
IBuilder *builder = createInferBuilder(gLogger);
// Create hardware configs
IBuilderConfig *config = builder->createBuilderConfig();
// Build an engine
ICudaEngine* engine = createModelEngine(maxBatchSize, builder, config, DataType::kFLOAT);
assert(engine != nullptr);
// serialize the engine into binary stream
(*modelStream) = engine->serialize();
// free up the memory
engine->destroy();
builder->destroy();
}
2.2.3 createModelEngine
(1)
In addition to the network construction part of the process, only the setting of the workspace size is different. There is a description in the official blogsetMaxWorkspaceSize
, which is roughly the allocation of GPU space, 1ULL << 30
which is 1GB, and the unit of allocation space is bytes.
"两个函数的输入是相同的"
ICudaEngine* createMLPEngine(unsigned int maxBatchSize, IBuilder *builder, IBuilderConfig *config, DataType dt)
ICudaEngine* createLenetEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)
"读取权重, wts文件的格式是相同的, 估计是通用的, 读进来是个字典, 跳过读取细节"
std::map<std::string, Weights> weightMap = loadWeights("../mlp.wts");
std::map<std::string, Weights> weightMap = loadWeights("../lenet5.wts");
"创建空模型"
INetworkDefinition* network = builder->createNetworkV2(0U);
INetworkDefinition* network = builder->createNetworkV2(0U);
"创建输入, name 和 type 是一样的, lenet的维度是 1, 28, 28"
"这里看起来比pytorch的输入少了一个batchsize的维度, 但在后面有 setMaxBatchSize"
ITensor* data = network->addInput("data", DataType::kFLOAT, Dims3{
1, 1, 1});
ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{
1, INPUT_H, INPUT_W});
"构建网络, 后面单独说明"
"Set configurations"
builder->setMaxBatchSize(1);
builder->setMaxBatchSize(maxBatchSize);
"Set workspace size"
config->setMaxWorkspaceSize(1 << 20);
config->setMaxWorkspaceSize(16 << 20);
"Build CUDA Engine using network and configurations"
ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
(2) Build the network part
// mlp
IFullyConnectedLayer *fc1 = network->addFullyConnected(*data, 1, weightMap["linear.weight"], weightMap["linear.bias"]);
fc1->getOutput(0)->setName("out");
network->markOutput(*fc1->getOutput(0));
The structure of mlp is too simple, just look at lenet. First review the structure of lenet:
[ 1 , 1 , 32 , 32 ] [1,1,32,32][1,1,32,32] → \to → Conv2d(1,6,5,1,0)+relu
→ \to → [ 1 , 6 , 28 , 28 ] [1,6,28,28] [1,6,28,28] → \to → AvgPool2d(2,2,0)
→ \to → [ 1 , 6 , 14 , 14 ] [1,6,14,14] [1,6,14,14] → \to → Conv2d(6,16,5,1,0)+relu
→ \to → [ 1 , 16 , 10 , 10 ] [1,16,10,10] [1,16,10,10] → \to → AvgPool2d(2,2,0)
→ \to → [ 1 , 16 , 5 , 5 ] [1,16,5,5] [1,16,5,5] → \to → [ 1 , 400 ] [1,400] [1,400] → \to → Linear(400,120)+relu
→ \to → [ 1 , 120 ] [1,120] [1,120] → \to → Linear(120,84)+relu
→ \to → [ 1 , 84 ] [1,84] [1,84] → \to → Linear(84,10)+softmax
→ \to → [ 1 , 10 ] [1,10] [1,10]
Corresponding to the API of pytorch, we can see the usage of tensorrt API:
Convolution: addConvolutionNd(输入, 输出维度, kernel大小, 权重, 偏置)
Activation: addActivation(输入, 激活函数类型)
Pooling: addPoolingNd(输入, 池化模式, kernel大小)
setStrideNd
Used to set the step size of convolution and pooling Full
connection:addFullyConnected(输入, 输出维度, 权重, 偏置)
Finally, we give the output of the last layer a name and let the network mark it as an output.
// lenet
IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 6, DimsHW{
5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]);
conv1->setStrideNd(DimsHW{
1, 1});
IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);
IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{
2, 2});
pool1->setStrideNd(DimsHW{
2, 2});
IConvolutionLayer* conv2 = network->addConvolutionNd(*pool1->getOutput(0), 16, DimsHW{
5, 5}, weightMap["conv2.weight"], weightMap["conv2.bias"]);
conv2->setStrideNd(DimsHW{
1, 1});
IActivationLayer* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);
IPoolingLayer* pool2 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kAVERAGE, DimsHW{
2, 2});
pool2->setStrideNd(DimsHW{
2, 2});
IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);
IActivationLayer* relu3 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);
IFullyConnectedLayer* fc2 = network->addFullyConnected(*relu3->getOutput(0), 84, weightMap["fc2.weight"], weightMap["fc2.bias"]);
IActivationLayer* relu4 = network->addActivation(*fc2->getOutput(0), ActivationType::kRELU);
IFullyConnectedLayer* fc3 = network->addFullyConnected(*relu4->getOutput(0), OUTPUT_SIZE, weightMap["fc3.weight"], weightMap["fc3.bias"]);
ISoftMaxLayer* prob = network->addSoftMax(*fc3->getOutput(0));
prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
network->markOutput(*prob->getOutput(0));
2.3 Running inference
2.3.1 Main process
The entire reasoning process doInference
uses the tensorrt API except for , and it can be seen that only the ready-made logging.h
and are needed model.engine
.
// create a model using the API directly and serialize it to a stream
char *trtModelStream{
nullptr};
size_t size{
0};
// read model from the engine file
std::ifstream file("../model.engine", std::ios::binary);
if (file.good()) {
file.seekg(0, file.end);
size = file.tellg();
file.seekg(0, file.beg);
trtModelStream = new char[size];
assert(trtModelStream);
file.read(trtModelStream, size);
file.close();
}
// create a runtime (required for deserialization of model) with NVIDIA's logger
IRuntime *runtime = createInferRuntime(gLogger);
// deserialize engine for using the char-stream
deserializeCudaEngine
ICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);
// create execution context -- required for inference executions
IExecutionContext *context = engine->createExecutionContext();
// 创建输入输出
float data[INPUT_SIZE]; // mlp:1 lenet:H*W
float out[OUTPUT_SIZE]; // mlp:1 lenet:10
// time the execution
auto start = std::chrono::system_clock::now();
// do inference using the parameters
doInference(*context, data, out, 1);
// time the execution
auto end = std::chrono::system_clock::now();
2.3.2 doInference
The lenet and mlp doInference
are also the same, the reasoning is actually just one sentence context.enqueue(batchSize, buffers, stream, nullptr);
, and the rest of the work is to create and store input and output space, create cuda stream, and move data between CPU and GPU.
After looking at the code of yolov5, running reasoning is actually just one sentence, and the main work is to prepare data.
void doInference(IExecutionContext &context, float *input, float *output, int batchSize) {
// Get engine from the context
const ICudaEngine &engine = context.getEngine();
// Pointers to input and output device buffers to pass to engine.
// Engine requires exactly IEngine::getNbBindings() number of buffers.
assert(engine.getNbBindings() == 2);
void *buffers[2];
// In order to bind the buffers, we need to know the names of the input and output tensors.
// Note that indices are guaranteed to be less than IEngine::getNbBindings()
const int inputIndex = engine.getBindingIndex("data");
const int outputIndex = engine.getBindingIndex("out");
// Create GPU buffers on device -- allocate memory for input and output
cudaMalloc(&buffers[inputIndex], batchSize * INPUT_SIZE * sizeof(float));
cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float));
// create CUDA stream for simultaneous CUDA operations
cudaStream_t stream;
cudaStreamCreate(&stream);
// copy input from host (CPU) to device (GPU) in stream
cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_SIZE * sizeof(float), cudaMemcpyHostToDevice, stream);
// execute inference using context provided by engine
context.enqueue(batchSize, buffers, stream, nullptr);
// copy output back from device (GPU) to host (CPU)
cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost,
stream);
// synchronize the stream to prevent issues
// (block CUDA and wait for CUDA operations to be completed)
cudaStreamSynchronize(stream);
// Release stream and buffers (memory)
cudaStreamDestroy(stream);
cudaFree(buffers[inputIndex]);
cudaFree(buffers[outputIndex]);
}