TensorRT Development

For basic usage of TensorRT, please refer to TensorRT deployment in Model Deployment .

TensorRT basics

  1. The core of TensorRT lies in the optimization of model operators (multiple strategies such as merging operators, using GPU feature-specific kernel functions, etc.), through tensorRT, the best performance can be obtained in Nvidia series GPUs.
  2. Therefore, the tensorRT model needs to select the optimal algorithm and configuration in the way it actually runs on the target GPU.
  3. So the model generated by tensorRT can only run under certain conditions (compiled trt version, cuda version, GPU when compiling fortunately).

The main content includes model structure definition, compilation process configuration, reasoning process implementation, plug-in implementation, onnx understanding.

The picture above shows the optimization process of tensorRT. On the left is an unoptimized basic network model diagram. TensorRT will find that the three layers in the large ellipse have the same network structure, so they are merged into the CBR block of the optimized network structure on the right. . The three convolutional layers (3*3, 5*5, 1*1) in the middle of the left picture can also be optimized into 3 CBRs. In this way, it can be directly optimized and accelerated, reducing the process of data flow and achieving higher performance optimization.

Now let's develop a simple neural network, image -> fully connected layer -> sigmoid activation function -> output prob.

#include <NvInfer.h>
#include <NvInferRuntime.h>
#include <cuda_runtime.h>
#include <stdio.h>

using namespace nvinfer1;
//在tensorrt中进行日志打印和输出类
class TRTLogger: public ILogger {
public:
    virtual void log(Severity severity, const char *msg) noexcept override {
        if (severity <= Severity::kVERBOSE) {
            printf("%d: %s\n", severity, msg);
        }
    }
};

Weights make_weights(float* ptr, int n) {
    Weights w;
    w.count = n;
    w.type = nvinfer1::DataType::kFLOAT;
    w.values = ptr;
    return w;
}

int main() {
    TRTLogger logger;
    //神经网络的创建者,绑定日志
    IBuilder *builder = createInferBuilder(logger);
    //构建配置,指定TensorRT应该如何优化模型,TensorRT生成的模型只能在特定配置下运行
    IBuilderConfig *config = builder->createBuilderConfig();
    //神经网络,由创建者创建
    INetworkDefinition *network = builder->createNetworkV2(1);
    //输入的图像是3通道的
    const int num_input = 3;
    //输出的pro是二分类
    const int num_output = 2;
    //网络参数,前3个给w1的rgb,后3个给w2的rgb
    float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5};
    //偏置值
    float layer1_bias_values[] = {0.3, 0.8};
    //网络添加输入节点,输入类型为3通道的图像
    ITensor *input = network->addInput("image", nvinfer1::DataType::kFLOAT, Dims4(1, num_input, 1, 1));
    //创建tensorrt专有的权重和偏置
    Weights layer1_weight = make_weights(layer1_weight_values, 6);
    Weights layer1_bias = make_weights(layer1_bias_values, 2);
    //网络添加全连接层,输入为3通道图像,输出为2通道prob
    auto layer1 = network->addFullyConnected(*input, num_output, layer1_weight, layer1_bias);
    //网络添加激活层,它的输入是全连接层的输出,激活类型为sigmoid
    auto prob = network->addActivation(*layer1->getOutput(0), ActivationType::kSIGMOID);
    //在网络中将prob标记为输出
    network->markOutput(*prob->getOutput(0));

    printf("workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f);
    config->setMaxWorkspaceSize(1 << 28); //256M
    builder->setMaxBatchSize(1);  //推理的batchsize为1
    //推理引擎,由创建者创建
    ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
    if (engine == nullptr) {
        printf("Build engine failed.\n");
        return -1;
    }
    //将模型序列化,并存储为文件
    IHostMemory *model_data = engine->serialize();
    FILE *f = fopen("engine.trtmodel", "wb");
    fwrite(model_data->data(), 1, model_data->size(), f);
    fclose(f);
    //卸载顺序按照构建顺序倒序
    model_data->destroy();
    engine->destroy();
    network->destroy();
    config->destroy();
    builder->destroy();
    printf("Done.\n");

    return 0;
}

Makefile (I am developing on NVIDIA Jetson nano jetpak 4.5, tensorrt version number is 7.1.1)

EXE=main

INCLUDE=/usr/include/aarch64-linux-gnu/
LIBPATH=/usr/lib/aarch64-linux-gnu/
CFLAGS= -I$(INCLUDE) -I/usr/local/cuda-10.2/include
LIBS= -L$(LIBPATH) -lnvinfer -L/usr/local/cuda-10.2/lib64 -lcudart -lcublas -lstdc++fs

CXX_OBJECTS := $(patsubst %.cpp,%.o,$(shell find . -name "*.cpp"))
DEP_FILES  =$(patsubst  %.o,  %.d, $(CXX_OBJECTS))

$(EXE): $(CXX_OBJECTS)
                $(CXX)  $(CXX_OBJECTS) -o $(EXE) $(LIBS)

%.o: %.cpp
                $(CXX) -c -o $@ $(CFLAGS) $(LIBS) $<

clean: 
                rm  -rf  $(CXX_OBJECTS)  $(DEP_FILES)  $(EXE)

test:
                echo $(CXX_OBJECTS)

operation result

workspace Size = 256.00 MB
4: Applying generic optimizations to the graph for inference.
4: Original: 2 layers
4: After dead-layer removal: 2 layers
4: After Myelin optimization: 2 layers
4: After scale fusion: 2 layers
4: After vertical fusions: 2 layers
4: After final dead-layer removal: 2 layers
4: After tensor merging: 2 layers
4: After concat removal: 2 layers
4: Graph construction and optimization completed in 0.0724424 seconds.
4: Constructing optimization profile number 0 [1/1].
4: *************** Autotuning format combination: Float(1,1,1,3) -> Float(1,1,1,2) ***************
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_64x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x32_relu_nn_v1
4: --------------- Timing Runner: (Unnamed Layer* 0) [Fully Connected] (CaskFullyConnected)
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x128_relu_nn_v1
4: Tactic: 8883888914904656451 time 0.0325
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x64_relu_nn_v1
4: Tactic: 5453137127347942357 time 0.028385
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_64x64_relu_nn_v1
4: Tactic: 5373503982740029499 time 0.028333
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: Tactic: 4133936625481774016 time 0.016875
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x32_relu_nn_v1
4: Tactic: 1933552664043962183 time 0.016927
4: Fastest Tactic: 4133936625481774016 Time: 0.016875
4: --------------- Timing Runner: (Unnamed Layer* 0) [Fully Connected] (CudaFullyConnected)
4: Tactic: 0 time 0.01974
4: Tactic: 1 time 0.023021
4: Tactic: 9 time 0.026927
4: Tactic: 26 time 0.019167
4: Tactic: 27 time 0.018907
4: Tactic: 48 time 0.019167
4: Tactic: 49 time 0.019844
4: Fastest Tactic: 27 Time: 0.018907
4: >>>>>>>>>>>>>>> Chose Runner Type: CaskFullyConnected Tactic: 4133936625481774016
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: 
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_64x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x32_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: *************** Autotuning format combination: Float(1,1,1,2) -> Float(1,1,1,2) ***************
4: --------------- Timing Runner: (Unnamed Layer* 1) [Activation] (Activation)
4: Tactic: 0 is the only option, timing skipped
4: Fastest Tactic: 0 Time: 0
4: Formats and tactics selection completed in 0.281916 seconds.
4: After reformat layers: 2 layers
4: Block size 268435456
4: Block size 512
4: Total Activation Memory: 268435968
3: Detected 1 inputs and 1 output network tensors.
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: Layer: (Unnamed Layer* 0) [Fully Connected] Weights: 24 HostPersistent: 384 DevicePersistent: 1536
4: Layer: (Unnamed Layer* 1) [Activation] Weights: 0 HostPersistent: 0 DevicePersistent: 0
4: Total Host Persistent Memory: 384
4: Total Device Persistent Memory: 1536
4: Total Weight Memory: 24
4: Builder timing cache: created 1 entries, 0 hit(s)
4: Engine generation completed in 11.306 seconds.
4: Engine Layer Information:
4: Layer(caskFullyConnectedFP32): (Unnamed Layer* 0) [Fully Connected], Tactic: 4133936625481774016, image[Float(3,1,1)] -> (Unnamed Layer* 0) [Fully Connected]_output[Float(2,1,1)]
4: Layer(Activation): (Unnamed Layer* 1) [Activation], Tactic: 0, (Unnamed Layer* 0) [Fully Connected]_output[Float(2,1,1)] -> (Unnamed Layer* 1) [Activation]_output[Float(2,1,1)]
Done.

 

The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework development Digital Guangdong's apology statement on CEC-IDE Suzaku imitation Song——The first open source imitation Song font automatically skips opening the screen Advertising application "Li Tiaotiao" stopped updating indefinitely. Podman Desktop exceeded 500,000 downloads. System Initiative announced to open source all its software. The Chinese version of the Unity engine "Unity Engine" was officially released. The Windows QQ client had a remote code execution vulnerability . cn website domain name
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3768341/blog/10103419