TensorRT deployment depth learning model

1. Background

The speed of the current mainstream deep learning framework (caffe, mxnet, tensorflow, pytorch, etc.) are not good model to infer the framework for the deployment of the model with the actual project they tend to be relatively inefficient. The trained model to deploy the mainstream framework can greatly improve the speed of the model inferred by Nvidia launched tensorRT tools, often compared to the original frame to have at least 1 times the speed increase, but will take up more memory devices less. So the need to deploy all gay model, the mastery of methods to deploy deep learning model with tensorRT is very useful.

2. Related Art

The picture above is taken from TensorRT official website, there are some techniques tensorRT use. We can see a more mature deep learning landing technique: model to quantify dynamic memory optimization, integration layer technologies have been integrated in tensorRT, which is the reason it can greatly increase the speed of inference model. Overall tensorRT the trained model through a series of optimization techniques can be converted to high-performance execution code is generated in the last FIG Inference engine on a particular platform (GPU). There are also some other tools can achieve similar tensorRT functions such as TVM , TensorComprehensions can effectively improve the model to infer the speed on a particular platform, but due to the current computing devices are Nvidia production enterprise mainstream use on these devices nvidia launched tensorRT performance compared to other tools will have an advantage of some. And tensorRT dependent code base includes only C ++ and cuda, as opposed to a number of other tools to be more streamlined.

3. tensorflow model tensorRT deployment tutorial

Deployment in the use of the actual project c ++ for deployment, so this tutorial also using tensorRT the C ++ API, tensorRT version 5.1.5. Specific reference may be mounted tensorRT Guide [deep learning] TensorRT installed , and installation instructions official website.

Persistence model

The first step in deploying tensorflow model is the model persistence, the model structure and weight saving to a file .pb them.

pb_graph = tf.graph_util.convert_variables_to_constants(sess, sess.graph.as_graph_def(), [v.op.name for v in outputs])
with tf.gfile.FastGFile('./pbmodel_name.pb', mode='wb') as f:
    f.write(pb_graph.SerializeToString())

DETAILED only performed after the definition of the model code above and a weight reading, the weighting function calls tf.graph_util.convert_variables_to_constants into constant, wherein a list of required outputs tensor is output, and finally pb_graph.SerializeToString () The graph serialization and write to the file pb them, thus generating pb model.

Uff generation model

With pb model, you need to convert it to uff model tensorRT available, simply call uff package comes with a script to convert

python /usr/lib/python2.7/site-packages/uff/bin/convert_to_uff.py   pbmodel_name.pb

Information such as the number of successful conversion outputs the following information, summarized in FIG comprising an input point and an output node inferred

tensorRT c ++ API deployment model

Deployment generated using tensorRT good uff uff model stored in the model structure into the network weights, and in, and then execute the optimization algorithm to generate a corresponding inference engine needs to talk about. Specific code as follows, you first need to define a IBuilder * builder, a file used to resolve network parser uff and builder created the model parameters and network structure parser will parse out uff file saved to network, to be resolved before the foretold parser input network output node. After parsing engine builder can create a network based on the network structure defined. Batchsize will need to specify the maximum size before creating engine, batchsize entered when after using the engine must not exceed this value otherwise an error occurs. If inferring highest batchsize and set the same maximum efficiency. For example, if the maximum is set to 10 batchsize, enter a batch actual reasoning FIG 10, when the average time is 4ms per estimation, then, enter a batch of less than 10 per FIG when the average time is higher than the inferred FIG. 4ms.

IBuilder* builder = createInferBuilder(gLogger.getTRTLogger());
auto parser = createUffParser();
parser->registerInput(inputtensor_name, Dims3(INPUT_C, INPUT_H, INPUT_W), UffInputOrder::kNCHW);
parser->registerOutput(outputtensor_name);
    INetworkDefinition* network = builder->createNetwork();
    if (!parser->parse(uffFile, *network, nvinfer1::DataType::kFLOAT))
    {
        gLogError << "Failure while parsing UFF file" << std::endl;
        return nullptr;
    }  
    builder->setMaxBatchSize(maxBatchSize);
    builder->setMaxWorkspaceSize(MAX_WORKSPACE);
    ICudaEngine* engine = builder->buildCudaEngine(*network);
    if (!engine)
    {
        gLogError << "Unable to create engine" << std::endl;
        return nullptr;
    }

After generating the engine can infer the need for a context execution context IExecutionContext * context inference is performed, can be obtained by engine-> createExecutionContext (). The core code is performing inference

 context->execute(batchSize, &buffers[0]);

Wherein the buffer is a void * array corresponding to the model inputs and outputs tensor device address, a corresponding pointer is stored into the buffer array by cudaMalloc open the equipment space (memory) input and output required by cudaMemcpy the input data before the execute operation (input image) into device space corresponding to the copy input, or after execution by cudaMemcpy execute the result output from the copying apparatus.

A more detailed routines can reference TensorRT official samples in sampleUffMNIST Code

Speedup case

The actual project I used tensorRT to accelerate over Resnet-50 on Tesla M40, Inception-resnet-v2, Google image search model Delf (DEep Local Features), accelerated around a single view of inferred compare the figure below (in ms) when using

4. Caffe model tensorRT deployment tutorial

Compared with model conversion tensorflow caffe simpler model, the model does not require tensorflow uff turn model such operation, tensorRT able to parse and direct prototxt caffemodel file acquiring network structure and heavy weight models. Consistent and specific analysis procedure described above, except that the parser caffe model does not require a pre-specified input layer, which is already defined as prototxt input layer, Parser can automatically parse the input, the additional network returns a parsed caffeparser IBlobNameToTensor * blobNameToTensor records correspondence between the network and the name of the tensor pototxt, will need to pass this correspondence, according to the name list of outputs output tensor order to find the corresponding tensor after parsing through network-> markOutput function to mark it as output , then you can generate the engine.

IBuilder* builder = createInferBuilder(gLogger);
    INetworkDefinition* network = builder->createNetwork();
    ICaffeParser* parser = createCaffeParser();
    DataType modelDataType = DataType::kFLOAT;
    const IBlobNameToTensor *blobNameToTensor =	parser->parse(deployFile.c_str(),
                                                              modelFile.c_str(),
                                                              *network,
                                                              modelDataType);
    assert(blobNameToTensor != nullptr);
    for (auto& s : outputs) network->markOutput(*blobNameToTensor->find(s.c_str()));

    builder->setMaxBatchSize(maxBatchSize);
    builder->setMaxWorkspaceSize(1 << 30);
    engine = builder->buildCudaEngine(*network);

Generation mode is executed after the engine and a consistency described in detail with reference to the routine may SampleMNIST

Speedup case

Actual project I accelerated in Tesla M40 with tensorRT caffe through the VGG19, SSD speed becomes 1.6 times, ResNet50, MobileNetV2 single longitudinal acceleration estimation comparing FIG below (in ms) when a

5. Add the custom layer tensorRT

tensorRT currently only supports some very common operation, there are many operations it does not support such a sampling Upsample operations, this time you need to write our own plug-ins for the tensorRT layer, so that these operations can not be supported in use in tensorRT . To define Upsample layer, for example, we first define a class inherits from tensorRT Upsample plugin base class

class Upsample: public IPluginExt

Then necessary to implement some methods of the class, two first constructor, a parameter is passed to build, and the other is built from the serialized bit stream.

 Upsample(int scale = 2) : mScale(scale) {
        assert(mScale > 0);
    }
//定义上采样倍数
 Upsmaple(const void *data, size_t length) {
        const char *d = reinterpret_cast<const char *>(data), *a = d;
        mScale = read<int>(d);
        mDtype = read<DataType>(d);
        mCHW = read<DimsCHW>(d);
        assert(mScale > 0);
        assert(d == a + length);
    }
~Upsample()
    {

    }

Some layers of information output method as defined

   int getNbOutputs() const override {
        return 1;
    }
//模型的输出个数

    Dims getOutputDimensions(int index, const Dims *inputs, int nbInputDims) override {
       // std::cout << "Get ouputdims!!!" << std::endl;
        assert(nbInputDims == 1);
        assert(inputs[0].nbDims == 3);
        return DimsCHW(inputs[0].d[0], inputs[0].d[1] * mScale, inputs[0].d[2] * mScale);
    }
//获取模型输出的形状

The method and the shape of the number of input data and check the validity of the type of layer configuration parameters employed

    bool supportsFormat(DataType type, PluginFormat format) const override {
        return (type == DataType::kFLOAT || type == DataType::kHALF || type == DataType::kINT8)
               && format == PluginFormat::kNCHW;
    }
//检查层是否支持当前的数据类型和格式
    void configureWithFormat(const Dims *inputDims, int nbInputs, const Dims *outputDims, int nbOutputs,
                             DataType type, PluginFormat format, int maxBatchSize) override
       {
         mDtype = type;
         mCHW.c() = inputDims[0].d[0];
         mCHW.h() = inputDims[0].d[1];
         mCHW.w() = inputDims[0].d[2];
        }
//配置层的参数

Method layer sequence

 size_t getSerializationSize() override {
        return sizeof(mScale) + sizeof(mDtype) + sizeof(mCHW);
    }
//输出序列化层所需的长度
    void serialize(void *buffer) override {
        char *d = reinterpret_cast<char *>(buffer), *a = d;
        write(d, mScale);
        write(d, mDtype);
        write(d, mCHW);
        assert(d == a + getSerializationSize());
    }
//将层参数序列化为比特流

The method of calculating layer

 size_t getWorkspaceSize(int maxBatchSize) const override {
        return 0;
    }
//层运算需要的临时工作空间大小
 int enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace,
                cudaStream_t stream) override;
//层执行计算的具体操作

In enqueue we call write good cuda kenerl to Upsample calculations

Upsample complete the definition of the class, we can add directly to the network of the plug we prepared, by the following statement we define the sample on a sample layer 2 times. The first input addPluginExt is ITensor ** category, which is to support the multi-output, and the second parameter is the number of inputs, plug-in class Object third parameter is the need to create.

Upsample up(2)；
auto upsamplelayer=network->addPluginExt(inputtensot,1,up)

6. Add a custom layer support CaffeParser

For if our custom layer wrote caffe prototxt, call caffeparser when deploying analytic models will give an error.

Or in Upsample example, if you have the following piece to add a layer in a upsample in prototxt

layer {
  name: "upsample0"
  type: "Upsample"
  bottom: "ReLU11"
  top: "Upsample1"
}

The next time you call

const IBlobNameToTensor *blobNameToTensor =	parser->parse(deployFile.c_str(),
                                                              modelFile.c_str(),
                                                              *network,
                                                              modelDataType);

There will be mistakes

We have previously written Upsample plug-ins, how to make a caffe parser tensorRT identified upsample layer prototxt automatic build our own plug-ins written in it? Then we need to define a plug-engineering class inherits the base class nvinfer1 :: IPluginFactory, nvcaffeparser1 :: IPluginFactoryExt.

class PluginFactory : public nvinfer1::IPluginFactory, public nvcaffeparser1::IPluginFactoryExt

The method must be implemented wherein there is a layer determines whether the plugin approach, the input parameter is the name prototxt in the layer, as judged by a layer name is registered as widget

bool isPlugin(const char *name) override {
        return isPluginExt(name);
    }

bool isPluginExt(const char *name) override {

        char *aa = new char[6];
        memcpy(aa, name, 5);
        aa[5] = 0;
        int res = !strcmp(aa, "upsam");
        return res;
}
//判断层名字是否是upsample层的名字

The method according to the name created widget, there are two ways one reconstructed by the weight, and the other by the bit stream created serialized, corresponding to two constructors widget upsample no weight, for others who have weight plug can be incoming weights initialization layer. mplugin a vector is used to store all plug-ins created layer.

IPlugin *createPlugin(const char *layerName, const nvinfer1::Weights *weights, int nbWeights) override {
        assert(isPlugin(layerName));
        mPlugin.push_back(std::unique_ptr<Upsample>(new Upsample(2)));
        return mPlugin[mPlugin.size() - 1].get();
    }
IPlugin *createPlugin(const char *layerName, const void *serialData, size_t serialLength) override {
        assert(isPlugin(layerName));

        return new Upsample(serialData, serialLength);
    }
 std::vector <std::unique_ptr<Upsample>> mPlugin;

Finally, we need to define a destroy method to release all plug-ins created layer.

 void destroyPlugin() {
        for (unsigned int i = 0; i < mPlugin.size(); i++) {
            mPlugin[i].reset();
        }
}

For the case where there are a plurality of various plug prototxt, you can add a new conditional branch isPlugin, createPlugin method, according to create plug-ins corresponding layer name layer.

The realization of a PluginFactory when calling caffeparser the need to set up to use it, calling parser-> add the following code before the parser

PluginFactory pluginFactory;
parser->setPluginFactoryExt(&pluginFactory);

You can set up rules parser defined in accordance with pluginFactory inside to create a plug-in layer, the error can not be resolved before the emergence of such Upsample layer will not occur again.

The official added plug-layer sample samplePlugin can be used as reference

7. experiences (stepped pit record)

1. turn tensorflow model, pb generation model, and the model conversion uff when calling uffparser register Input, output, input and output nodes of these three names must pay attention to the process of consistent, otherwise the eventual errors when parsing the parser, Can not find the input and output nodes.

2. In addition to the herein exemplified pluginExt, tensorRT plugin base class also IPlugin, IPluginV2, a base class inherit the class methods are needed to achieve the subtle differences, particularly where self-installation view tensorRT folder include / NvInfer.h file. While adding its own layer to write to the network functions have addPlugin, addPluginExt, addPluginV2 these types and IPlugin, IPluginExt, IPluginV2 correspondence can not be mixed, otherwise some default class method invocation does not call, such as adding a addPlugin the PluginExt configureWithFormat layer does not call the method, because the method does not IPlugin class. Also in there caffeparser of setPluginFactory and setPluginFactoryExt also can not be mixed.

3. Run the program cuda failure occurs under normal circumstances due to copy memory data to disk when there was an illegal memory access, pay attention to check whether the size of the copy buffer space and opened up the past data consistency.

4. There are some operations may be combined, but not supported by some of the supported operating alternatively in tensorRT, such as [official] , so save some time to write a custom layer.

5. tensorflow flatten when operating in the default keepdims = False, but when converted uff default text conversion according keepdims = True, therefore flatten the vector transpose performed in the tensorflow, expanddims like operation, after conversion to uff prone to error when parsing with tensorRT, such as "Order size is not matching the number dimensions of TensorRT". Is preferably provided tensorflow reduce, keepdims = True flatten operation, always maintaining the output layer 4 dimensional form, can effectively avoid errors when various strange to tensorRT.

After there are certain problems in the slice layer 6.tensorRT, I network-> addSlice add a slice to the network layer, this step in the implementation of buildengine when an error occurs nvinfer1 :: builder :: checkSanity (const nvinfer1 :: builder :: Graph &) : Assertion `tensors.size () == g.tensors.size () 'failed, best to avoid using the slice layer is constructed network, to realize his or custom layer slice operation is performed.

github 7. tensorRT in open source and has a wealth of sample code section, and a lot of learning to help quickly grasp the use tensorRT