DeepStream User Guide

目录布局

DeepStream SDK包含两个主要部分：库和工作流演示示例。已安装的DeepStream软件包包括目录/lib，/include，/doc和/samples。

动态库libdeepstream.so位于/lib目录中
有两个头文件：deepStream.h和module.h
- deepStream.h包括解码输出的定义、支持的数据类型、推断参数、分析器类和DeepStream worker，以及函数的声明。
- module.h是插件实现的头文件。对于没有插件的应用程序，此文件不是必需的。
/samples文件夹包含解码、解码并推理以及插件实施的示例。更多信息可以在样本章节找到

工作流程

实时视频流分析需要实时解码和神经网络推理。对于解码工作，多个线程并行执行并将各种输入流馈送给GPU硬件解码器。对于推理部分，一个主线程通过调用TensorRT推理引擎来处理所有批量的推理任务。插件系统允许用户将更复杂的工作流添加到流水线中。

解码和推理工作流程

DeepStream的输入由本地视频文件（H.264、HEVC等）或在线视频流的多个视频信道组成。DeepStream工作流程非常简单，包括以下步骤。

DeepStream Workflow
图3-1：DeepStream工作流程

应使用GPU设备ID、最大信道编号等来创建DeviceWorker。
这些视频作为几个视频数据包馈入DeviceWorker。
一旦获得输入数据包DeviceWorker就开始解码并分析解码帧。
- DeviceWorker中的帧池用于收集来自所有解码器的解码帧。DeviceWorker中的分析流水线为下一个阶段处理获取一批帧。
- 通常，DeviceWorker中的分析流水线由一对预定义的模块组成，用于“色彩空间转换”和推理。
- 所有解码器在多个主机线程中共享相同的CUDA上下文（primary context）。分析流水线位于一个主机线程中。

从编程的角度来看，用户应该通过定义和配置DeviceWoker来构建流水线，然后通过调用DeviceWorker->start()来执行工作流程。流水线将由DeviceWorker->stop()停止。

插件机制

在DeepStream中，解码后的工作流称为“分析流水线”。除了预定义的color space converter和TensorRT-based inference模块外，用户还可以定义自己的模块。这里涉及两个主要的类：定义这个模块的输入输出格式的IStreamTensor和模块机制的实现IModule。

IStreamTensor

所有模块都使用IStreamTensor格式，允许数据的形状具有最多四个维度。张量类型可以是float、nv12_frame、对象坐标或用户定义的类型。

代码3-1：张量类型定义

typedef enum {
FLOAT_TENSOR = 0,      //!< float tensor
NV12_FRAME  = 1,       //!< nv12 frames
OBJ_COORD = 2,         //!< coords of object
CUSTOMER_TYPE = 3      //!< user-defined type
} TENSOR_TYPE;

数据可以存储在GPU或CPU中，由MEMORY_TYPE所指定。

代码3-2：存储器类型定义

typedef enum {
GPU_DATA = 0, //!< gpu data
CPU_DATA = 1 //!< cpu data
} MEMORY_TYPE;

数据和信息作为IStreamTensor从前一个模块获取。代码3-3列出了IStreamTensor中的功能。

代码3-3：IStreamTensor成员函数

virtual void* getGpuData() = 0;
virtual const void* getConstGpuData() = 0;
virtual void* getCpuData() = 0;
virtual const void* getConstCpuData() = 0;
virtual size_t getElemSize() const = 0;
virtual MEMORY_TYPE getMemoryType() const = 0;
virtual TENSOR_TYPE getTensorType() const =0;
virtual std::vector<int>& getShape() = 0;
virtual std:vector<TRACE_INFO > getTraceInfos() = 0;
virtual int getMaxBatch() const = 0;

virtual void setShape(const std::vector<int>& shape) = 0;
virtual void setShape(const int n, const int c, const int h, const int w) = 0;
virtual void setTraceInfo(std::vector<TRACE_INFO >& vTraceInfos) = 0;
virtual void destroy() = 0;

对于模块的输入，可以使用带“get”前缀的函数来获取信息和数据。对于模块的输出，“set”前缀函数有助于设置信息。应使用createStreamTensor创建IStreamTensor。

代码3-4：createStreamTensor全局函数

inline IStreamTensor *createStreamTensor(const int nMaxLen, const
size_t nElemSize, const TENSOR_TYPE Ttype, const MEMORY_TYPE Mtype,
const int deviceId) {
return reinterpret_cast<IStreamTensor
*>(createStreamTensorInternal(nMaxLen, nElemSize, Ttype, Mtype,
deviceId));
}

请务必传递或更新跟踪信息，其中包括相关帧索引、视频索引等的信息。这可用于跟踪检测到的目标的属性。

Module

每个模块应该包含initialize()、execute()和destroy()。该模块可以通过IDeviceWorker->addCustomerTask（在deepStream.h中定义）添加到DeviceWork中。在添加任务时，DeepStream系统执行“initialize”，执行DeviceWorker->start()时，串行调用各个模块的“execute”函数。调用DeviceWorker->destroy()时执行“destroy”函数。

示例

deepstream提供三个示例（位于/samples目录）：

decPerf
使用deepstream测试视频解码性能。
nvDecInfer_classification
使用deepstream测试视频解码和分类的推理。
nvDecInfer_detection
使用DeepStream测试视频解码和检测推理。

/common
包含所有示例共享的头文件
/data/model
包含分别用于分类和检测用例的GoogleNet和Resnet18预训练模型
/data/video
包含示例所使用的视频

DECPERF

decPerf示例用于测试GPU硬件解码的性能，并展示以下预分析工作流程：

将数据包送入流水线
添加解码任务
分析解码性能

DeviceWorker

DataProvider和FileDataProvider类定义
这两个类的定义在dataProvider.h头文件中。FileDataProvider用于从视频文件中加载数据包。dataProvider接口如代码4-1所示。

Code 4-1: dataProvider Interface

// get data from the data provider. If return false, it means no
more data can be load.
bool getData(uint8_t **ppBuf, int *pnBuf);
// reset the dataProvider to reload from the video file from the
beginning.
void reload();

创建一个DeviceWorker
DeviceWorker负责整个DeepStream工作流程。它包括多通道解码和分析流水线维护。DeviceWorker由createDeviceWorker函数创建，具有多个通道（g_nChannels）和GPU ID（g_devID）作为参数。

代码4-2：创建DeviceWorker

// Create a deviceWorker on a GPU device, the user needs to set the
channel number.
IDeviceWorker *pDeviceWorker = createDeviceWorker(g_nChannels,
g_devID);

在DeviceWorker中添加解码任务
DeviceWorker->addDecodeTask只有一个参数，

代码4-3：添加解码任务

// Add decode task, the parameter is the format of codec.
pDeviceWorker->addDecodeTask(cudaVideoCodec_H264);

运行DeviceWorker
添加解码任务后，DeviceWorker将创建N个解码器（N == g_nChannels）。解码工作将通过多个主机线程并行提交给GPU解码器。

代码4-4：DeviceWorker（解码和分析流水线）启动和停止

// Start and stop the DeviceWorker.
pDeviceWorker->start();
pDeviceWorker->stop();

将数据包推入DeviceWorker
DeviceWorker将处于挂起状态，直到用户将视频数据包推入。DecPerf通过使用示例中定义的userPushPacket()来提供示范。
代码4-5：将视频数据包推送到DeviceWorker

// User push video packets into a packet cache
std::vector<std::thread > vUserThreads;
for (int i = 0; i < g_nChannels; ++i) {
     vUserThreads.push_back( std::thread(userPushPacket, 
                             vpDataProviders[i],
                             pDeviceWorker,
                             i
                            ) );
}
// wait for user push threads
for (auto& th : vUserThreads) {
    th.join();
}

Profiler

decPerf示例说明了在执行期间使用IDecodeProfiler接口对解码操作的分析。

DecodeProfiler类实现了IDecodeProfiler接口，并实现了分析解码操作所需的reportDecodeTime方法。为与每个信道关联的解码器实例创建并注册该类的一个实例。

代码4-6：设置解码分析器

pDeviceWorker->setDecodeProfiler(g_vpDecProfilers[i], i);

对于每个解码信道，每解一帧都会调用回调函数reportDecodeTime()。记录下帧索引、视频信道、设备ID和解码该帧的时间。

回调

除了上节介绍的插件机制外，DeepStream还提供了一个回调机制来从解码器获取数据，以用于不需要插件的简单情况。回调函数由用户定义。回调函数应该通过setDecCallback函数传递给DeviceWorker。

解码回调

代码4-7：解码的回调

typedef struct {
int frameIndex_; //!< Frame index
int videoIndex_; //!< Video index
int nWidthPixels_;  //!< Frame width
int nHeightPixels_;  //!< Frame height
uint8_t *dpFrame_;  //!< Frame data (nv12
format)
size_t frameSize_;  //!< Frame size in bytes
cudaStream_t stream_;  //!< CUDA stream
} DEC_OUTPUT

typedef void (*DECODER_CALLBACK)(void *pUserData, DEC_OUTPUT
*decOutput);

在DeviceWorker中设置解码回调函数

代码4-8：设置Deviceworker中的解码回调函数

/** \brief Set Decode callback function
*
* User can define his/her own callback function to get the NV12
frame.
* \param pUserData The data defined by user.
* \param callback The callback function defined by user.
* \param channel The channel index of video.
*/
virtual void setDecCallback(void *pUserData, DECODER_CALLBACK
callback, const int channel) = 0;

运行脚本

通过在/decPerf目录中运行make构建示例。

在构建之前，请确保遵循以下步骤：

通过适当修改Makefile.sample_decPerf文件中的VIDEOSDK_INSTALL_PATH和NVIDIA_DISPLAY_DRIVER_PATH变量，确保在系统上设置正确的VideoSDK和Display Driver安装路径
安装“系统要求”中列出的依赖项

要运行示例，请执行/decPerf目录中的run.sh脚本。脚本中的各种配置选项如下所示：

----------------------------------------------------------------------
-devID: The device ID of GPU
-channels: The number of video channels
-fileList: The file path list, format: file1,file2,file3,…
-endlessLoop: If value equals 1, the application will reload the video at the end of
video.
../bin/sample_decPerf -devID=${DEV_ID}  -channels=${CHANNELS} \
-fileList=${FILE_LIST} -endlessLoop=1;
-----------------------------------------------------------------------

运行日志

样本执行的结果显示在下面的日志中。日志中的要点用红色标注。

-----------------------------------------------------------------------
./run.sh
[DEBUG][11:51:32] Device ID: 0
[DEBUG][11:51:32] Video channels: 2
[DEBUG][11:51:32] Endless Loop: 1
[DEBUG][11:51:32] Device name: TITAN X (Pascal)
[DEBUG][11:51:32] =========== Video Parameters Begin =============
[DEBUG][11:51:32] Video codec : AVC/H.264
[DEBUG][11:51:32] Frame rate : 30/1 = 30 fps
[DEBUG][11:51:32] Sequence format : Progressive
[DEBUG][11:51:32] Coded frame size: [1280, 720]
[DEBUG][11:51:32] Display area : [0, 0, 1280, 720]
[DEBUG][11:51:32] Chroma format : YUV 420
[DEBUG][11:51:32] =========== Video Parameters End =============
[DEBUG][11:51:32] =========== Video Parameters Begin =============
[DEBUG][11:51:32] Video codec : AVC/H.264
[DEBUG][11:51:32] Frame rate : 30/1 = 30 fps
[DEBUG][11:51:32] Sequence format : Progressive
[DEBUG][11:51:32] Coded frame size: [1280, 720]
[DEBUG][11:51:32] Display area : [0, 0, 1280, 720]
[DEBUG][11:51:32] Chroma format : YUV 420
[DEBUG][11:51:32] =========== Video Parameters End =============
[DEBUG][11:51:33] Video [0]: Decode Performance: 718.89 frames/second || Decoded Frames:
500
[DEBUG][11:51:33] Video [1]: Decode Performance: 711.68 frames/second || Decoded Frames:
500 ←- decode performance for each channel
[DEBUG][11:51:33] Video [0]: Decode Performance: 762.77 frames/second || Decoded Frames:
1000
[DEBUG][11:51:33] Video [1]: Decode Performance: 748.20 frames/second || Decoded Frames:
1000
[DEBUG][11:51:34] Video [0]: Decode Performance: 770.27 frames/second || Decoded Frames:
1500
[DEBUG][11:51:34] Video [1]: Decode Performance: 738.86 frames/second || Decoded Frames:
1500
[DEBUG][11:51:35] Video [0]: Decode Performance: 758.35 frames/second || Decoded Frames:
2000
[DEBUG][11:51:35] Video [1]: Decode Performance: 766.09 frames/second || Decoded Frames:
2000
-----------------------------------------------------------------------

SDK提供了两个版本的nvDecInfer示例，用以说明如何使用SDK构建解码+推理的工作流程。名为nvDecInfer_classification和nvDecInfer_detection的两个示例分别代表分类和检测用例。这两个示例的基本设计和体系结构在很大程度上是常见的。我们详细描述nvDecInfer_classification示例，检测示例的实现在很大程度上是类似的。

nvDecInfer_classification示例演示了视频解码和推理的典型用法。解码的帧被转换成BGR planar格式，并使用GoogleNet和TensorRT来实现推理。有一个用户定义的插件可将top-5概率的结果打印到日志文件中。

4.2.1.1将模块添加到分析流水线中

Module: color space convertor

解码帧的格式是NV12（YUV420），为推理模型转换为RGB planar。

代码4-9：颜色空间转换器

// Add frame paser
IModule* pConvertor = pDeviceWorker-
>addColorSpaceConvertorTask(BGR_PLANAR);

Module: inference

推理模块需要一个网络描述文件（prototxt），一个训练过的权值文件（caffemodel），以及输入和输出名称作为参数。

这部分也是模块连接的一个例子。addInferenceTask的第一个参数是前一个模块和前一个模块的输出索引。
代码4-10：添加推理任务

// Add inference task
std::string inputLayerName = "data";
std::vector<std::string > outputLayerNames(1, "prob");
IModule*pInferModule=pDeviceWorker->addInferenceTask(
        std::make_pair(pConvertor, 0),
        g_deployFile,
        g_modelFile,
        g_meanFile,
        inputLayerName,
        outputLayerNames,
        g_nChannels);

用户定义的准确度检查模块

Code 4-11: Defining and Adding User-defined Module into Pipeline
该模块是用户定义模块的示例。top 5结果的概率将被记录到日志文件中。用户定义的模块继承自module.h文件中的IModule类。当通过DeviceWorker->addCustomerTask将模块添加到流水线中时，应指定之前的模块。

代码4-11：定义和添加用户定义的模块到流水线中

// user-defined module inherits from IModule
class UserDefinedModule : public IModule {
…
};
// adding module into pipeline
PRE_MODULE_LIST preModules;
preModules.push_back(std::make_pair(pInferModule, 0));
UserDefinedModule *pAccurancyModule = new
UserDefinedModule(preModules, g_validationFile, g_synsetFile,
g_nChannels, logger);
assert(nullptr != pAccurancyModule);
pDeviceWorker->addCustomerTask(pAccurancyModule);

Profiler

每个模块（预定义或用户定义的）都可以定义它们的分析器，并在DeepStream执行过程中被调用。

代码4-12：定义和设置模块分析器

// define module profiler
class ModuleProfiler : public IModuleProfiler {…}
// setup module profiler
pConvertor->setProfiler(g_pConvertorProfiler);
pInferModule->setProfiler(g_pInferProfiler);
pAccuracyModule->setProfiler(g_pAccuracyProfiler);

Callback

每个模块都可以有自己的回调函数来得到结果。

代码4-13：模块回调函数

typedef void (*MODULE_CALLBACK)(void *pUserData,
std::vector<IStreamTensor *>& out);
virtual void setCallback(void *pUserData, MODULE_CALLBACK callback)
= 0;

运行示例

通过在/nvDecInfer_classification目录中运行make构建示例。
在构建之前，请确保遵循以下步骤：

确保通过适当修改Makefile.sample_classification文件中的VIDEOSDK_INSTALL_PATH和NVIDIA_DISPLAY_DRIVER_PATH变量，设置系统上正确的VideoSDK和Display Driver安装路径
需要安装“系统需求”中列出的依赖项

为执行示例，用户需要拼拢一个演示视频。NVIDIA在samples/data/video目录中提供了一个脚本（generate_video.sh），通过从ImageNet数据集中下载图像并将它们拼接在一起来生成视频。有关视频和这些图像的使用条款，请参阅http://image-net.org/download-faq。

运行示例如下：

1.若尚未安装，请安装ffmpeg。

sudo apt-get update
sudo apt-get install ffmpeg

2.在samples/data/video路径下执行generate_video.sh脚本自动生成一个示例视频（名为sample_224x224.h264）。

3.执行run.sh脚本。

下面显示了用户可以在脚本中配置的重要配置选项。

-----------------------------------------------------------------------
-channels: The number of video channels
-fileList: The file path list, format: file1,file2,file3,…
-deployFile: The path to the deploy file
-moduleFile: The path to the model file
-meanFile: The path to the mean file
-synsetFile: The synset file
-validationFile: The label file
-endlessLoop: If value equals 1, the application will reload the video at the end of
video (Use Crtl-C to end)
../bin/sample_classification -nChannels=${CHANNELS} \
-fileList=${FILE_LIST} \
-deployFile=${DEPLOY} \
-modelFile=${MODEL}  \
-meanFile=${MEAN}  \
-synsetFile=${SYNSET} \
-validationFile=${VAL} \
-endlessLoop=0
---------------------------------------------------------------------

运行日志

示例执行的结果显示在下面的日志中。日志中的要点用红色表示。

-----------------------------------------------------------------------
./run.sh
[DEBUG][12:03:58] Video channels: 2
[DEBUG][12:03:58] Endless Loop: 0
[DEBUG][12:03:58] Device name: TITAN X (Pascal)
[DEBUG][12:03:59] Use FP32 data type.
[DEBUG][12:04:01] =========== Network Parameters Begin ===========
[DEBUG][12:04:01] Network Input:
[DEBUG][12:04:01] >Batch :2
[DEBUG][12:04:01] >Channel :3
[DEBUG][12:04:01] >Height :224
[DEBUG][12:04:01] >Width :224
[DEBUG][12:04:01] Network Output [0]
[DEBUG][12:04:01] >Channel :1000
[DEBUG][12:04:01] >Height :1
[DEBUG][12:04:01] >Width :1
[DEBUG][12:04:01] Mean values = [103.907, 116.572,122.602]
[DEBUG][12:04:01] =========== Network Parameters End ===========
[DEBUG][12:04:01] =========== Video Parameters Begin =============
[DEBUG][12:04:01] Video codec : AVC/H.264
[DEBUG][12:04:01] Frame rate : 25/1 = 25 fps
[DEBUG][12:04:01] Sequence format : Progressive
[DEBUG][12:04:01] Coded frame size: [224, 224]
[DEBUG][12:04:01] Display area : [0, 0, 224, 224]
[DEBUG][12:04:01] Chroma format : YUV 420
[DEBUG][12:04:01] =========== Video Parameters End =============
[DEBUG][12:04:01] =========== Video Parameters Begin =============
[DEBUG][12:04:01] Video codec : AVC/H.264
[DEBUG][12:04:01] Frame rate : 25/1 = 25 fps
[DEBUG][12:04:01] Sequence format : Progressive
[DEBUG][12:04:01] Coded frame size: [224, 224]
[DEBUG][12:04:01] Display area : [0, 0, 224, 224]
[DEBUG][12:04:01] Chroma format : YUV 420
[DEBUG][12:04:01] =========== Video Parameters End =============
[DEBUG][12:04:05] Video[1] Decoding Performance: 31.03 frames/second || Total Frames: 100
←- Decode performance for each channel
[DEBUG][12:04:05] Video[0] Decoding Performance: 31.04 frames/second || Total Frames: 100
[DEBUG][12:04:05] Analysis Pipeline Performance: 62.02 frames/second || Total Frames: 200
←- Combined end to end decode+inference performance across all channels
[DEBUG][12:04:08] Video[0] Decoding Performance: 30.69 frames/second || Total Frames: 200
[DEBUG][12:04:08] Video[1] Decoding Performance: 30.69 frames/second || Total Frames: 200
[DEBUG][12:04:08] Analysis Pipeline Performance: 61.39 frames/second || Total Frames: 400
[DEBUG][12:04:11] Video[1] Decoding Performance: 30.13 frames/second || Total Frames: 300
[DEBUG][12:04:11] Video[0] Decoding Performance: 30.08 frames/second || Total Frames: 300
[DEBUG][12:04:11] Analysis Pipeline Performance: 60.15 frames/second || Total Frames: 600
-----------------------------------------------------------------------

NVDECINFER_DETECTION

nvDecInfer_detection示例演示如何使用ResNet-18网络来实现DeepStream的上检测用例。该网络支持三类对象的检测：汽车、人和两轮车。它借助TensorRT的支持来将训练好的网络优化到降低的INT8精度，然后将其部署在NVIDIA Tesla P4 GPU上，从而提高效率。请注意，网络是未经过修改的，仅用于说明用途，但不保证准确性或性能。

构建示例

在构建之前，请确保遵循以下步骤：

确保通过适当修改Makefile.sample_detection文件中的VIDEOSDK_INSTALL_PATH和NVIDIA_DISPLAY_DRIVER_PATH变量，设置系统上正确的VideoSDK和Display Driver安装路径
需要安装“系统需求”中列出的依赖项

检测示例可能绘制检测到物体的边界框作为GUI的一部分。为了支持这一点，用户需要安装一些必要的依赖软件包，如下所示：

Mesa-dev packages

sudo apt-get install build-essential
sudo apt-get install libgl1-mesa-dev

libglu

sudo apt-get install libglu1-mesa-dev

freeglut

sudo apt-get install freeglut3-dev

openCV

sudo apt-get install libopencv-dev python-opencv

glew

Install from project webpage: http://glew.sourceforge.net/index.html

安装依赖关系之后，通过在/nvDecInfer_detection路径下运行make来构建示例。

运行示例

该示例可以通过运行run.sh脚本来执行，并使用samples/data/video目录中的sample_720p.h264视频作为输入。示例视频以及其他输入参数可以在run.sh脚本中根据需要进行配置。

默认情况下，检测到物体的有关信息将以KITTI格式发送到/logs目录下每个通道的日志文件中。日志中只填充目标类型和边界框坐标字段。结果的GUI可视化在默认情况下是禁用的。可以使用“-gui 1”选项启用它。请注意，窗口管理器需要运行以支持这个用例。

下面显示了脚本中用户可以配置的重要选项。

-----------------------------------------------------------------------
-channels: The number of video channels
-fileList: The file path list, format: file1,file2,file3,…
-deployFile: The path to the deploy file
-moduleFile: The path to the model file
-meanFile: The path to the mean file
-synsetFile: The synset file
-validationFile: The label file
-gui: enable gui (outputs kitti logs by default)
-endlessLoop: If value equals 1, the application will reload the video at the end of
video (Use Crtl-C to end)
../bin/sample_detection  -nChannels=${CHANNELS} \
                         -fileList=${FILE_LIST} \
                         -deployFile=${DEPLOY} \
                         -modelFile=${MODEL}  \
                         -meanFile=${MEAN}  \
                         -synsetFile=${SYNSET} \
                         -validationFile=${VAL} \
                         -endlessLoop=0
---------------------------------------------------------------------