Kenichi mainly do two things, first, cash registers project, the second is 6Dpose estimate of competition, these two do not good.
I came across a professional, learned a little lesson after Andrew Ng finished research network, after coming directly into the arrangements for the project,

Cash Register Project

Process Overview

Optimization tensorRT the first to do then is transplanted edge equipment

tensorRT

Let me talk about results, not engage them, pruning model used in place of

tensorRT the weight required files into ONNX format, and then converted into ONNX trt format.

After taking over the processing tensorrt, like the previous model provided by the students turn onnx and onnx turn trt program yolo after pruning processing, the first step model onnx turn smoothly, but in the second part will not turn trt found onnx serialization.

model turn onnx problem

At first thought to be onnx turn trt program in question, after access to online information, but found PyTorch framework for online transfer trt procedures are similar, almost the same, this time, the suspected model turn onnx program turn out onnx file in question.

In order to verify doubt correctly, download a training mnist of onnx files from the Internet, it can be found trt successfully converted file.

Preliminary view is transferred onnx model program in question, but in order to consolidate this conclusion, the official yolov3.onnx file used to test, but run error:

after the search, is the problem of excessive batch_size official batch_size is 64, as shown

and we usually use batch_size is 1, the machine does not support this large batch_size up, could not verify whether the official onnx able to turn rtr.

At this time I changed a way, by Nerotn visualization to compare differences in official onnx and turn out of our own onnx, the difference is as follows:
Our own onnx (with the original cfg and best.pt, are not pruned, so We used official resources and the same) of the final layers:

The official onnx final layers are as follows:

the initial determination is transferred onnx program model is a problem, start access to information, but this time they found that the wording PyTorch framework are basically the same, others can be a smooth transition to the last.
Try to write the official tensorrt parser conversion model, but an error (python2.7 version, directly copied, just modify the read cfg and weight of address), error resolution methods have not been found.

At this point, the question began to anxious, how to solve the problem, some of the existing few ideas :

Read through official tensorrt parser written about yolov3 of (python2.7 version, nearly 800 lines), and then to improve it (I do not know to write and direct the improved framework could have much difference, high time costs and benefits may not good)
Direct use of official trt to accelerate, if the good effect after yolo accelerated unfathomed than after pruning but accelerated yolo, it might be possible to use this method
With other frameworks to convert, such as tensorflow, tensorflow framework does not require the transfer onnx, can be directly used tensorrt accelerate reasoning, but may also encounter new challenges tensorflow (mostly online with tensorrt accelerate PyTorch framework written yolo, he explains the superiority PyTorch framework, on the one hand also shows whether other frameworks to accelerate performance in terms of tensorrt not PyTorch good?)

To solve the problem, I found a new open source project, is the official conversion of the sample (handwritten parser) to make changes on the addition of a 1_batch_size and multi_size code, so last week 64_batch_size not have to deal with the problem is solved .

cfg parser and the official weights and the same author, but this time I found a reason to read cfg file error, is pruning the last one less cfg file \ n.

Item 1 address: https: //github.com/Cw-zero/TensorRT_yolo3_module

这个项目带来的价值：解决了官方代码只能处理64_batch_size, 无法处理1_batch_size的问题

利用该项目的代码，可完整将原版和剪枝的cfg和weights -> onnx ->trt（利用的是官方手写yolo解析器相关代码，不是用的Pytorch框架）

对剪枝模型的加速：

之后，试了一下这个项目的tensorrt推理，没有计算速度，只是检测能否顺利运行，结果是运行蛮顺利的。
下一步，将该项目的利用tensorrt推理的代码移植到了剪枝的yolo上，发现竟然画不出任何bounding_box。
然后将原版yolo的trt也放到用于处理剪枝的代码中，虽然精度不高但是能顺利画出bounding_box。
输出剪枝的trt推导出的detection，发现全都是0，而原版的trt推导出来的不是。
回头检查问题，发现剪枝的onnx可视化后，有几个数据是独立于整个框架的，仔细观察剪枝的cfg,发现其中有maxpool层，然而原版yolo中并没有maxpool层，
所以官方给出的解析器也无法处理maxpool层，只是跳过了。

出现这种情况的问题在于我不了解该剪枝算法，我原本以为只是减掉了几个层，又合并了几个层，没有加额外的其它层，这时开始寻找新的方法。

两个思路：
一、用Pytorch框架转换onnx，然后又项目1的代码转trt（之所以不用pytorch一口气转，是因为tensorrt5.0不支持upsample层，而无论是剪枝还是原版yolo，都要用到upsample层，官方的没用框架的代码可以很好的解析upsample层，如果要用pytorch那么就得自己定义upsample层解析），之前失败的原因，经过这几天的查阅，发现确实是Pytorch版本的问题，tensorrt5.0出的时候是对应的Pytorch0.4.1的，从0.4.1到1.0，Pytroch中带的ONNX已经升级了，tensorrt未必能解读新版本的ONNX。
但相对应的问题有是：需要装双版本cuda，现安装版本是10.0，而Pytorch0.4.1要求的是cuda9.0。

二、找新的剪枝模型，和原版结构相同的那种
项目2地址：https://github.com/Lam1360/YOLOv3-model-pruning
找到如上模型，原版的剪枝，层数更少，没有maxpool层，而且还是检测手的，已经训练好可以直接使用。
相对应的问题：需要在python3.6环境运行，Pytorch1.0及以上版本

一开始先以第一个为执行方案，先在自己电脑的虚拟机中安装pytorch0.4.1和cuda，但是cuda安装失败了，经查阅，发现虚拟机中无法安装cuda, 硬安装的方法也有，但很麻烦，而装双版本cuda也很麻烦，所以先执行第二个思路试试

将项目2的cgf和pth文件放到项目1的转换代码中，顺利的转出onnx和trt文件，可视化onnx文件，发现这次没有奇怪的层和脱离框架的数据，很欣慰。
然后建立python3.6的虚拟环境，安装pytorch时出现Bug，发现服务器cuda10.0的软连接没有了，可能是之前换pytorch版本时搞坏了

这个周在进行项目2的tensorrt加速，换了多种方法，改动shape也无法正确用tensorrt推理，折腾多天无果。后来看了下这个剪枝模型参考的论文，发现与原版相比，减掉了很多CNN的channel，可能这就是我使用的tensorrt加速代码无法正常运行的原因，这个代码是针对原版yolo写的，或许对层进行了改动会无法识别？
在陷入困难之际，决定试试项目2在tx2上的速度，发现最高能达到9 fps，所以之后就没有继续搞tensorrt了

边缘设备

在这个项目的时间里，大部分时间都是在搞移植，一共接触了四块板子：Nano，TX1，TX2，rk3399pro。其中前三块板子还好，支持pytorch模型(我们的代码就是pytorch写的)，但是第四块就有点拉跨了，而且公司那边还就想用第四块(便宜，有NPU，计算速度稍快)。

公司那边提出用rk3399pro，经调研，发现当其GPU和NPU同时运行时，速度能达到Nano的二倍左右，然后用新的模型在Nano上试了一下，能达到4帧，如果3399的效果真的能达到Nano二倍，或许可以试试，测试了下，rk3399 pro是可以运行tensorflow框架的yolo的，能跑到7帧左右，然后就开始着手模型转换的工作。

问题就在于第四块支持的框架有限(仅部分tensorflow，caffe，darknet，不支持pytorch）。

一开始的想法是，想直接用网上的代码将pytorch框架转换成tensorflow框架，经查阅后，发现有两种方式：

一、pytorch -> onnx -> keras -> tensorflow
由于之前研究tensorRT的时候用到了很多的onnx文件，而且现在还保存的，就先尝试了这种方法。
这是该方法用到的代码来源：https://github.com/nerox8664/onnx2keras
但是转换时发现了问题，就是只能转换layer，到了转换weight的时候就会卡住，keras model转换不出来
报错代码：TypeError: unhashable type:‘google.protobuf.pyext._message.RepeatedScalarContainer’
之后尝试更换多个protobuf版本，发现还是不可以，目前认为可能是onnx文件有些问题，但是还没找到解决的方法。就先尝试了另一种方法

二、pytorch -> keras -> tensorflow
这个方法用到的代码来源：https://github.com/nerox8664/pytorch2keras
查看更多的资料，发现这个代码并不支持yolo层的转换，所以这个方法也不能用。
解决问题的思路：

用tensorflow复现剪枝和微调的代码（我们现在正在尝试，但是我觉得以我们现在的水平，不太可能）
在网上找一些tensorflow的yolo剪枝代码（这是我最开始尝试的方法，但是没找到，下个周再仔细看看）

尝试了使用ONNX模型和Darknet模型来让之前的剪枝yolo能在rk3399 pro上运行，遇到的问题：
ONNX模型：
将之前搞tensorrt的时候用的onnx放到rk3399提供的 onnx转rknn 代码中转换，转出来不能用。剪枝和原版的都试过了，判断不是剪枝的问题，原版的也不能用，网上说这个rk3399只对tensorflow有完整支持，而对其他模型支持不全，这可能是个原因。
Darknet框架：
我之前使用的权重文件是.pth文件，Darknet需要.weights文件，我是用之前代码自带的转换函数来转换：
model = Darknet(‘prune_yolov3-hand.cfg’)
weights = ‘prune_yolov3_ckpt.pth’
model.load_state_dict(torch.load(weights, map_location=‘cpu’))
model.save_darknet_weights(path=‘converted.weights’)
但是也报错
IndexError: index 1 is out of range
还没找到那里出了问题，这些转换看起来不是怎么靠谱

尝试了对yolo剪掉用不掉的小尺度来提高速度，后来发现无论只保留那个尺度，都能检测出近距离和远距离的手
Here Insert Picture Description

这两张手分别是只保留大尺度和只保留小尺度检测出来的，按理来讲不应该这样，但是运行起来三种尺度除了最大框的大小以外几乎没什么区别。

开始着手Darknet框架的处理，主要使用到了两个网站的内容:
ToyBrick社区：http://t.rock-chips.com/forum.php?mod=viewthread&tid=184
yolov3官网：https://pjreddie.com/darknet/train-cifar/

过程：

一开始直接对cfg进行改动，然后使用原版的weights，在pytorch框架下的yolov3是可以运行这样的文件并进行预测的。但是在Darknet框架中，这样没有用。
后来使用原版的weights，对cfg进行修改，然后提取相应的卷积层参数进行训练，这里只用了3张图片epoch了8次，训练的classes只有3种，只为尝试该思路的可行性。
成功训练之后，将其放到rk3399的darknet网络中，rknn的inference能顺利执行，并输出最后卷积层输出的参数。
之后对inference之后的数据维度处理进行了调整，使用图片进行测试，可以执行完检测程序(但是是没有检测效果的)，证明darknet的修改基本没有什么问题了。
对视频进行测试，能够达到10 FPS

先是训练了单尺度的yolov3，但是效果奇差，根本检测不出东西来；现在在训完整的单分类yolov3，如果效果好，就提取前82层(单尺度)的网络再试试。

完整的尺度的yolov3训练出来了，但是仍然发现什么都检测不到，一开始以为是训练的不好，后来将权重放到服务器上，发现可以正常预测，精度还可以。
因此，回头看了下官方提供的yolo_demo的代码，感觉他写的不太对，训练的单尺度权重放上去什么都跑不出来，object_threshold计算用了一种没见过的公式，跟之前看过的一些yolov3源码不太一样：
obj_thresh = -np.log(1/OBJ_THRESH - 1)
之后尝试用tensorflow的代码，但是因为对tensorflow不够了解，改起来很吃力，就尝试在服务器的代码中，将Pytorch实现的部分用Numpy来写，但是我对这两个也不是很熟，改起来比较慢，比较花费时间。

将Pytorch模型改为Numpy的工作已经完成了，但是预测的效果还是很差

Here Insert Picture Description