Relation Distillation Networks for Video Object Detection

ICCV2019的一篇视频目标检测的文章

论文链接：https://arxiv.org/pdf/1908.09511v1.pdf

据文中描述代码基于pytorch1.0所写，只是现在仍未开源。。。

主要贡献：设计了RDN(Relation Distillation Networks)模块，其实就是用两个stage来用support frame的proposal来逐步增强reference frames的proposal特征，以融合更多的proposal之间联系的特征，文中两个stage分别描述为base stage和advanced stage，最后advanced stage出来的feature用于最后检测的分类和回归，文中还涉及了一个Box Linking with Relations的post-processing的方式进一步提高算法性能。

下图中的(b)就是这个RDN，其中的relation感觉就是https://blog.csdn.net/breeze_blows/article/details/104677799中的relation module。(a)的描述感觉其实就是对应着https://blog.csdn.net/breeze_blows/article/details/104677799这篇文章，如果把(a)中的relation模块换成SELSA中设计的attention模块感觉就变成了https://blog.csdn.net/breeze_blows/article/details/104533004中的Sequence Level Semantics Aggregation for Video Object Detection（ICCV 2019)

RDN的详细流程如下图，首先就是给出reference frames It，然后在[t-T,t+T]中随机选取两帧作为support frames，T在实验中取9，然后将三帧通过faster rcnn的roi_pool和之后的fc之后得到若干proposal feature，也就是图中的ROI，选取Rs，Rr的时候文中的描述为top-K object proposals from reference frame as the reference object set Rr and pack all the top-K object proposals from support frames into the supportive pool Rs，K=75, 这好像和下图有点不一致，怎么It那里还有指向Rs的箭头？？？ basic stage接受Rs,Rr,输出初步relation之后的proposal feature Rr1，从Rs中选取r%（r=20）组成advanced supportive pool Rsa，将Rsa，Rs，Rr1输入到advanced stage输出Rr2用于最后的分类与回归

下图是测试的时候的inference流程，基本和训练一样，只不过这个时候用的是2T帧，其实可以看出RDN的训练与测试的流程和SELSA很像。

放一张论文中的结果图，这是没有经过post-process的实验结果。

文中还对比了不同后处理对于RDN的影响，在加上了自己设计的Box Linking with Relations之后，在resnet101的backbone时候精度达到了83.8.好像是我目前见到的最好精度了，也可以是我见的太少了。。。

其他：

其实感觉最近的目标检测文章好多都是这种设计attention然后找proposal或者说视频中物体之间的联系，进而融合特征，得到更好的检测效果，如果SELSA中也以某种方式实现RDN这种multi-stage的cascade方式，说不定精度可以从80.2提升到RDN中的81.8
感觉RDN在从select r% supportive proposals in Rs with high objectness scores to form the advanced supportive pool Rsa,的时候，直接按照当前rpn得到的proposal的得分取前面的会不会有点不妥，感觉可以尝试其他的选取前r%的方式
在文中的Ablation Study中对Relation Module Number Nb in Basic Stage做了实验，可以看出不是越多越好，文中的解释为We speculate that this may be the result of unnecessary information repeat from support frames and that double proves the motivation of designing the advanced stage in RDN. 感觉有点点牵强。。。

Relation Distillation Networks for Video Object Detection

猜你喜欢