Recognize Actions by Disentangling Components of Dynamics

行为识别：Recognize Actions by Disentangling Components of Dynamics论文笔记

文章补充材料：http://openaccess.thecvf.com/content_cvpr_2018/Supplemental/1067-supp.pdf

Abstract

本文提出一种新的convnet结果，该结构可以完全从原始视频帧中提取componets of dynamics，无需做光流估计。学习到的特征包括3部分：static appearance, apparent motion, and appearance changes.我们用3D pooling，cost volume processing, and warped feature differences分别用来提去上述三部分特征，所以网络有三个分支。

实验数据集：ucf101，kinetics

Introduction

行为识别关键问题：探索出一种efficient and effective way to capture the dynamics in videos.

我们将video认定为是： a combination of short-term dynamics and long-term temporal structures。选取TSN来作为识别long-term video modeling的工具，对于short-term representation, 提取static appearance（overall scene appearance，是时间上稳定）, apparent motion（由于相机or物体的移动带来的变化）, and appearance change（其他因素带来的变化）这三个特征。

整体模型：follow BN-Inception

具体：给定一个video，网络先利用几个卷积层提取low-level features。然后将特征输入到三个分支提取三种特征，最后将这些模块集合到一起来进行end-to-end预测。（怎么fuse的？后文说是average，补充材料里是在三个分支的最后接了一个卷积层）

key contribution： a unified network architecture for learning video representation.

模型：

Method

promary goal：develop an effcient and effective representation of short-term dynamics in videos

整体结构：1）网络输入：an input sequence of frames。给定一个short video clip,网络首先对individual frames产生64 channels的low-level feature maps，作为后续所有输入的基础。 2）这类基本特征输入到三个分支中 3）component-specific predictions would be combined into the final prediction by average

三个分支：

1.the static appearance branch

在之前的一些paper中，通常对per frame计算一个apperance feature，但这类feature对运动模糊和突然摄像机移动等细微差别很敏感，所以通过选择跨相邻帧的最高响应来减轻该问题。方案：在这一分支中引入卷积层和时间pooling层，前者提取visual patterens，后者用来stabilizing the features across neighboring frames（稳定相邻帧的特征）。结构：2d卷积+2d空间pooling+1d时间pooling，这里我们用2d空间pooling+1d时间pooling的组合构成3d pooling。输入64d特征向量，输入1024d特征向量。

为什么选2d提空间特征，而不是3d？因为1）2d足够提取空间特征，2）参数少，对样本量的需求也较少 3）可以用其他pretrained on image data的参数。（相当于用2d提空间特征，3dpooling来选取不同时间帧上的相应位置的max 点）

2.the apparent motion branch

apparent motion一般指视频帧中feature point的空间移动，在先前的工作中一般用dense光流来表征apparent motion，但computation is expensive。所以这里我们用cost volume，这是第一次在行为识别上直接用cost volumes来抽取motion represetation。（好像并木有TV-l1光流好）

如何构建cost volume construction？

1）在consecutive frames的low-level feature maps上构建 cost volumes。

2）给定一对feature map，Ft和Ft+1，构建一个cost volume：

window size：(2∆H + 1) × (2∆W + 1)

再算相邻帧的cosine相似度，得到Ct(i, j, δi, δj) （为什么这样就能代表motion？）

3）在Ct的基础上，再次derive a lower demensional representation。(没太懂为什么这么做）

计算位移映射矩阵（displacemet map）Vt（shape：H*W*2) 来捕获t到t+1时刻的运动。对于每个位置（i，j）都有2维向量vi,j=(vyi,j,vxi,j）分别表示x与y轴的位置，所以Vt：

其中的系数：

4）得到Vt，将其作为输入，输入到后续的卷积层中

3.the appearance change branch

不是视频中的所有变化都能用apparent motion来解释，如object's appearance or the vatiation in illumination也可以在视频帧中引起变化。（这里想说，appearance change和apparent motion 是不同的）。给定一对来自相邻帧的feature maps，Ft和Ft+1，计算一个wraped feature map F’t+1=W（Ft，Vt）,其中Vt是the estimated motion field from the apparent motion branch，wrap操作是bilinear interpolation。最后计算Ft+1-F’t+1，称之为wraped differenceces，来表征the representation of the appearance changes，1024d特征。

Experiment

数据集：ucf101 & kinetics

消融研究：1）先探究3个分支的性能 2）再探究short term dynamics representation 和ong term information 一起学习能否提高性能 3）组合三个分支的特征能否提高。消融研究的实验都在UCF101的split 1 上做。

等等等等多组实验

个人总结：

1）这篇文章并未对网络结构作改进，而是探究了多种特征作为输入，尤其是用apparent motion 和 appearance changes来代替光流，去表示物体的运动。

2）rgb特征上，作者认为对per frame计算RGB特征得到的apperance feature可能会对运动模糊和突然摄像机移动等细微差别很敏感，所以通过3dpooling来选择跨相邻帧的最高响应来减轻该问题。（有点意思）

一点疑问：

wraped feature 那里感觉很像TSN...TSN也提取过所谓的wrapped特征，可以考虑再回头看看TSN。

karen17

发布了56 篇原创文章 · 获赞 7 · 访问量 1万+

私信关注