Temporal Context Enhanced Feature Aggregation for Video Object Detection

论文链接：https://www.aaai.org/Papers/AAAI/2020GB/AAAI-HeF.1752.pdf AAAI2020的一篇文章

主要贡献：按照文中的说法主要有三个方面：用一个 Temporal context enhanced aggregation module (TCEA) 来聚合视频序列中帧之间的时空信息，用DeformAlign module来对齐帧之间空间信息，这个感觉是在模仿STSN: Object Detection in Video with Spatiotemporal Sampling Networks中的做法（https://blog.csdn.net/breeze_blows/article/details/105323491），最后就是训练了一个temporal stride predictor来自动的学习训练过程中选择需要聚合的帧，而不同于原来一般采用的估计在reference fram周围以估计stride选择support frames

整体框架图如下图所示，stride predictor用来预测在reference frame附近选择support frame的stride s(t), deform Align用来对齐Ff+s(t),Ff-s(t)和ft之间的空间信息。最后将ft和deform Align对齐之后的两个feature一起用TCEA进行时空信息的聚合，聚合最后的特征用于rpn和最后的目标分类与回归。

首先是对于TCEA，如下图，从图中其实可以看到non local和CBAM的影子（https://blog.csdn.net/breeze_blows/article/details/104834567），先是进行temporary的聚合，然后进行spatial，分别是为了找到“when”和"where'的信息有用，然后聚合到reference frame的feature之中，流程看图比较容易看明白，其中在spatial attention的时候，max pool和ave pool之后的feature是直接concat在一起的，不同于CBAM中的相加，上采样就是采用的双线性插值。

对于文中提出的deform align，个人感觉和STSN: Object Detection in Video with Spatiotemporal Sampling Networks中的做法（https://blog.csdn.net/breeze_blows/article/details/105323491）很相似，就是用ft和fi融合之后的feature作为Deformable Convolutional Networks的offsets(https://blog.csdn.net/breeze_blows/article/details/104998875)，进行deformable conv得到最后对齐之后的feature

从公式上面看的话，这里的 $\theta$ 应该就是上图中的offset，Δpn就是deformable conv中的偏移量，w应该就是卷积核。

对于Temporal Stride Predictor，原来视频目标检测中训练的时候在reference frame周围选择support frame的方法都是会固定一个stride，比如s0, 假设t为当前reference frame，另外的support frames就会在[t − s0, t, t + s0]这个范围内选择，本文的Temporal Stride Predictor认为应该根据视频序列, 这个Temporal Stride Predictor组成部分：two convolutional layers with 3 × 3 kernel and 256 channels, a global pooling, a fullyconnected layer and a sigmoid function，用reference frame和一帧support frame作为输入，得到一个deviation score，The deviation score is formally defined as the motion IoU，训练的时候ground truth应该就是两帧中gt box之间的iou，如果score<0.7，就认为物体运动很快，stride设置为9; score ∈ [0.7, 0.9] ， stride设置为24， score>0.9,stride设置为38，score越大证明iou越大则目标运动越慢，所以stride就越大。在测试的时候就用当前帧的前十帧计算score来判断目标速度，从而选取stride(In runtime, at reference frame t, ft and ft−10 are fed to this network to predict the motion speed of frame t.)

按照文中的描述，训练分为两个阶段，第一个阶段用于训练the DeformAlign module, and TCEA，每次从数据集中选三帧用于训练，stride固定为9，在第二个阶段除了temporal stride predictor的部分其他网络都被fixed了，即不进行backward的参数更新，训练的时候选取两帧，范围为[5,15]。

最后文中做了消融实验证明各个模块的作用，其实可以看出stride predictor的作用是最大的

与其他sota方法对比，不过很奇怪的是为什么没和2019年的方法进行对比呢

TCEet的temp post-proc的方法是Seq-NMS。

Temporal Context Enhanced Feature Aggregation for Video Object Detection

猜你喜欢