论文Action Genome: Actions as Composition of Spatio-temporal Scene Graphs

Action Genome is the first large-scale video database to provide both action labels and spatiotemporal scene graph labels.

Paper address: https://arxiv.org/pdf/1912.06992.pdf

GitHub address: https://github.com/JingweiJ/ActionGenome

Main contributions

  • Provides the Action Genome dataset, a representation that decomposes actions into a spatiotemporal scene graph that explains how objects and their relationships change as actions occur, as shown in the figure above;

  • The latest results of successfully improving action recognition and few-shot action recognition using Action Genome;

  • Based on the existing scene graph model, a new task of spatiotemporal scene graph prediction is proposed.

Related background

  1. Video understanding tasks represented by action recognition usually treat actions and activities as single events that occur in the video, that is, the video is analyzed as a single action. Correspondingly, many datasets also annotate a video with an action (end-to-end), without explicitly decomposing time into a series of interactions between objects.

  1. Although in the image field, image-based structures such as scene graphs have been proven to improve model performance on many tasks. However, in the field of video, the dismantling of time (objects and corresponding relationships) is still not much explored.

  1. Meanwhile, in cognitive science, there is already research supporting that humans divide long videos into segments for comprehension, actively encoding ongoing activities into hierarchical segment structures. Inspired by this, Action Genome provides a framework for studying behavioral dynamics when the relationship between people and objects changes. It can improve action recognition, achieve few-shot action detection, and introduce spatiotemporal scene graph prediction.

Main implementation

  1. In the video field, Action Genome is proposed to decompose actions into the form of spatio-temporal scenes to improve the understanding of time.

Taking "person sitting on a sofa" as an example, Action Genome annotates object and relation on its corresponding frame:

object:person,sofa

relation:person next to sofa,person in front of sofa,person sitting on sofa>

  1. Build a data set containing scene graphs based on Charades: Action Genome; follow the example shown above to annotate spatiotemporal scenes on videos, specifically including object and relationship;

The final dataset contains:

157 action categories;

234K video frames;

476K object bounding boxes;

1.72M relationships.

  1. The help of spatiotemporal scene graphs for video understanding has been proven on three tasks:

action recognition action recognition;

few-shot action recognition;

spatio-temporal scene graph prediction.

Implementation

  1. data set

First, let’s briefly introduce the scene graph:

  • node : object (the object corresponds to the node in the graph)

  • edge : realtionship (the relationship between objects corresponds to the edges between nodes in the graph)

Corresponding annotation and construction of data sets:

  • The entire data set is built based on Charades;

  • The annotation method is based on the actions in the video.

  1. Modeling visual concepts of space-time

As shown in the figure, for each action (different color segments) in the video, 5 frames are sampled uniformly within this time range for annotation. Assuming that there are 4 actions in a video (the action itself can be included or covered), then a total of 4x5=20 video frames will be annotated.

  1. The annotator first marks the object (bounding box) related to this action, and then marks the relationships.

  1. It contains a total of 3 types of realmships:

attention(looking or not)

spatial (空间位置)

contact (交互方式)

  1. 最后的数据集信息:

234253 帧 frames;

35 目标类型 object classes;

476229 包围框 bounding boxes;

25 关系类别 relationship classes;

1715568 实例 instances。

大多数对象都均匀地参与了三种类型的关系。

  1. 不像视觉基因组,数据集偏差会给预测给定对象类别的关系提供了一个强有力的基线,行动基因组没有这样的偏差。

  1. 实现方法

  1. 提出了一种场景图特征库SGFB方法,将时空场景图融合到动作识别中,使用信息库来表示特征,如:用于表示视频中出现的对象类别,甚至包括对象所在的位置。

  1. SGFB模型包含两个组件,第一个组件生成时空场景图;第二个组件对图进行编码以预测动作标签。

  1. SGFB为给定的视频序列中的每一帧生成一个时空场景图。预测视频中每一帧的场景图,这些场景图进行编码,被转换成特征表示,然后使用类似于长期特征库的方法进行组合,形成一个时空场景图特征库。最终与3DCNN特征合并,并用于预测动作标签。

  1. 具体分析如下图:

  • 看颜色的线路(蓝 vs 绿),最终的特征来源最终包含2个部分:时空场景图和3D CNN。

  • 其中时空场景图的部分:对于视频中的每一帧经过时空场景图预测(先用Faster RCNN进行目标检测,再用RelDN进行关系检测)构建对应的图,然后用类似long-term feature bank中的方法获取到图对应的特征表示。

  • 具体而言,图中看到的feature map是|O| x |R|大小的,|O|表示所有object的数目(已经包含person),|R|表示所有relationship的种类,其值等于对应object的置信度乘上对应relationship的置信度。然后对于每一帧,都把这个map展开作为这一帧的feature,最后对不同帧之间做一个融合得到场景图这一路得到的特征。

  • 3D CNN这一路是取视频中比较短的片段过3D conv主导的网络,最终得到的feature,这样可以结合短距离信息和长距离信息。3D CNN将短期信息嵌入S,为FSG提供上下文。

实验结果

  1. 动作识别上的结果

  • 在Charader数据集上,通过用时空场景图特征替换LFB(long-term feature bank)的特征库,能在SOTA的LFB上提升1.8% mAP。且其在更小尺寸上能捕捉更多识别动作的信息。

  • 细节:在每一帧上预测一个场景图,然后构建一个用于动作识别的时空场景图特征库,用Faster R-CNN和ResNet101作为区域提议和目标检测的骨干。利用ReIDN预测视觉关系。场景图预测在Action Genome上训练,遵循与Charader数据集相同的训练集和验证集分割视频。动作识别使用与LFB相同的特征提取器、超参数和求解调度程序。

  • 优化目标探测器:假设真实的scene graph是存在的情况下,直接用手工标注的GT进行scene-graph的构建,时空场景图特征库直接从真实对象和标注框架的视觉关系中编码出一个特征向量,将这些特征库放入SGFB模型中,能在mAP上获得16%的提升。

  • 如图可以看到,当基于视频的场景图模型使用时空场景图预测时,性能的提升显示了Action Genome和组合动作理解的潜力。

  1. 少命中动作识别的结果

  • Object:137个基类和20个新类。

  • 首先在积累的所有视频示例上训练主干特征提取器(R101-I2D-NL)这个LFB\SGFB\SGFB oracle共享。接下来,只从每个新类中选取k个例子来训练每个模型,k=1,5,10;epoch=50。

  • 结果:如下图所示,SGFB的表现优于LFB。

  • 可知,SGFB更好地捕捉涉及物体和关系的动态动作。

  1. 时空场景图预测的结果

  • 基于图像的场景图预测只有一个图像作为输入;而时空场景图则输入一个视频,利用邻近帧的时间信息来加强预测。

  • 使用三种标准评价模式:

场景图预测SGDet,它期望输入图像并预测边界框位置、对象类别和谓词标签;

场景图分类SGCls,它期望预测真实框和对象类别、谓词标签;

谓词分类PredCls,期望预测真实框和边界框和对象类别来预测谓词标签。

  • 将视频结果每一帧取平均值作为测试集的最终结果,如下图所示:

(1)IMP优于许多最近提出的方法;RelDN性能优于IMP,这表明对象和关系类之间的建模相似性提高了我们的任务性能;

(2)Predcls和sgcls任务之间的性能差距很小,表明这些模型无法准确地检测视频帧中的目标。

  • 总结:改进专为视频设计的物体检测器可以提高性能。

  • 因为模型只使用了Action Genome进行训练,而没有添加微调,添加微调后希望能有提高。

期望

  1. 时空动作本地化

大多数时空动作定位方法都专注于定位执行动作的人,而忽略了与人交互的对象。Action Genome可以同时研究行动者和对象的定位,形成更全面的有基础的行动定位任务。

  1. 可辩解的行为模型

动作基因组以物体的形式提供框架级别的注意力标签,执行动作的人要么在看,要么在与之互动。识别标签可以用于进一步训练可解释模型。

  1. 从时空场景图生成视频

最近的研究探索了从场景图生成图像。通过结构化的视频表示,希望能研究从时空场景图生成视频。

总结

Action Genome将动作分解成时空场景图。场景图解释了对象及其关系如何随着动作的发生而变化。通过收集大数据集的时空场景图来展示Action Genome的作用,并使用它来改进动作识别和少镜头动作识别的最新结果。最后,对新的场景图时空预测任务的结果进行了测试,实现了一定的性能提高。

希望Action Genome能在可分解和一般化的视频理解上激发一个新的研究方向。

Guess you like

Origin blog.csdn.net/Mr___WQ/article/details/129056540