DIN: Dynamic spatiotemporal inference network for group behavior recognition

In this work, the author and team proposed a dynamic spatio-temporal inference network for group activity recognition in videos ( Spatio-Temporal Dynamic Inference Network for Group Activity Recognition ) , which introduced the idea of ​​deformable convolution in the reasoning process of spatio-temporal graphs. , by predicting the global interaction graph of the central figure in the local spatio-temporal interaction domain and updating features, it solves the over-smoothing problem and the disadvantage of large amount of calculation that may have occurred in group behavior recognition before. Under the same experimental settings, the inference module only needs to use less than 10% of the calculation amount and parameter amount of the previous model to achieve optimal results on two industry authoritative data sets.

Paper link: 2108.11743.pdf (arxiv.org)

General group behavior recognition framework:

1 Related background

Group behavior recognition (GAR) is a sub-problem of character behavior recognition. Group behavior consists of the characters' individual behaviors and interactions between characters. This task aims to infer the overall behavior of the character group in the scene. GAR has a wealth of application scenarios, including surveillance video analysis, sports video analysis, social scene understanding , etc. The key issue of GAR is to combine spatiotemporal interaction factors to obtain a refined behavioral feature representation given a video clip.

Recently proposed inference modules mainly combine spatiotemporal interaction factors to obtain refined activity representations. Currently, the most commonly used methods are recurrent neural networks, attention mechanisms and graph neural networks (GNN) . GNN is a frequently adopted method in GAR, which performs message passing on the constructed semantic graph and obtains competitive results on public datasets. However, previous methods using GNN only model the interaction between individuals on a predefined graph, but have the following shortcomings:

  1. The interaction pattern for a given person is predefined rather than based on the visual spatiotemporal context of the target person, and predefined graph reasoning does not apply to feature updates for all people ;

  1. Predefined fully connected or cross-connected graph models can easily lead to over-smoothing, making features indistinguishable and degrading performance .

Furthermore, it incurs more computational overhead if scaled to long video clips or scaled to scenes with too many people.

Figure 1 Visualization of three reasoning schemes in GAR spatiotemporal domain based on gnn. Green nodes represent features to be updated. Purple nodes represent features involved in updating green nodes. (a) Fully connected graph reasoning; (b) Vertical and horizontal graph reasoning; (c) Human-specific dynamic graph reasoning is proposed, and each green node is unique. The dashed box is an example of initializing the interaction domain.

In response to the above defects, inspired by "Deformable convolutional networks" and "Dynamic graph message passing networks", this paper proposes a dynamic inference network (Dynamic Inference Network, DIN), which includes dynamic relationships (Dynamic Relation, DR) and dynamic walks ( Dynamic Walk, DW) . These two modules are combined to predict human-specific interaction graphs to better model interactions, as shown in Figure 1(c). For a specific character feature on the spatio-temporal graph, we set a spatio-temporal interaction domain around it as initialization, which is shared by DR and DW. The size of this interaction domain is not affected by spatial or temporal expansion, thus reducing computation.

In this initialized interaction domain, we use DR to predict the relationship matrix of the central feature (a specific person), which represents the interaction relationship between people. Then, to model long-term temporal and spatial dependencies, we use DW to predict the dynamic wandering shift of each feature within the domain. Dynamic walks allow locally initialized interaction domains to update features on a global spatiotemporal graph. DR and DW are simple to implement and can be easily deployed on any widely used backbone network. The author calls this entire spatiotemporal reasoning framework DIN .

In addition, previous methods rarely perform computational complexity analysis, which is an important evaluation of design modules. Therefore, this paper conducts a computational complexity analysis and shows that the proposed module has better results and lower computational overhead.

2 Network structure diagram

The basic framework of DIN proposed in this article is shown in the figure below: The input of DIN is a short video, which is input into the selected backbone network to extract visual features. For the backbone network, the author mainly conducts experiments on ResNet-18 and VGG-16, and then applies RoIAlign to extract character features aligned with the bounding box and embed them into the D-dimensional space. The author first constructs an initialized spatio-temporal graph, and the connections of this spatio-temporal graph are the spatio-temporal neighbors of the character's features (the spatial dimensions are sorted according to the coordinates of the person). On this initialized spatio-temporal graph, the author performs dynamic relationships and dynamic wander predictions within the defined interaction domain, and obtains interaction graphs with different central features (a total of T × N interaction graphs). Then the central features can be expressed in their respective interaction graphs. feature updates. Finally, DIN obtains the feature representation of the video through global spatiotemporal pooling .

The DIN network can be logically divided into two parts: spatiotemporal feature extraction and inference modules . The first part obtains a set of individual features . T and N represent the time step (i.e., time dimension) and the number of people marked in each frame. The DR and DW modules proposed in the paper dynamically predict a specific interaction graph for each feature. Based on these graphs we can make updates to specific features.

3 core highlights

Dynamic RelationDynamic Relation

这里的动态指的是,关系矩阵仅仅依赖于初始化的交互场中的特征。所依赖的特征不是固定的,而是动态变化的。

对于原始时空图上选定的第 i 个特征,我们将 uᵢ 表示为其交互域内的堆叠特征,并用 K 作为交互域大小(例如,如果交互场是 3×3 ,则 K=9 )。我们将卷积重写为矩阵形式:

上面是计算第i个特征关系矩阵的表达式。这个算法和ARG的差异很大。这个应该是可变形卷积,之后再读相关论文进行理解。并且对于该矩阵,我们使用softmax进行归一化:

不同于以往在全局图上更新特征,我们使用以下方式更新特征,并且只使用单图更新:

上面的特征更新表达式从形式上与ARG的图卷积层是基本一致的,区别主要在于范围一个是K,另一个是T*N。

动态漫游 Dynamic Walk

虽然DR可以在初始化交互域推断出人物特征的关系,但它仍然遵循预定义的消息传递路线,且不能建模时空长距离的交互。我们提出了DW模块,该模块使交互领域内的特征能够在主时空图上执行动态游走,如上图中下部分支所示。通过DW,我们希望使用大小受限的交互域对复杂的全局时空依赖性进行建模。DW中的“动态”是指交互图依赖于已初始化交互域中的特征,这不再是预定义的。

为了允许动态行走,我们需要预测它们的时空动态行走偏移。对于选定的第 i 个人物特征,我们将交互域内所有特征的动态游走偏移表示为,同样通过卷积方式计算:

其中,预测动态游走偏移的线性投影矩阵,偏移量为所有交互域内堆叠起来的特征向量。

得到漫步偏移量后,动态漫步的特征的计算公式如下:

为第i个交互场的第k个特征的坐标。

结合DR和DW

基于游走后的特征,重新给出特征更新的公式:

4 实验结果

1 消融实验

论文进行消融实验来说明提出方法的有效性。MCA和MPCA分别代表,分类准确率和平均类准确率。

实验主要体现此方法在GAR问题中的有效性,并且DR和DW可以结合使用,均有不同程度的性能提升,且二者的顺序差异不大。

2 和近年顶会/顶刊上已发表工作的性能对比、参数对比、计算量对比

作者使用的初始化交互域为3×3,为了公平比较,本文将他们的所有主干设置为 ResNet-18。此外,主干网络和特征嵌入层的统计数据同样给出:对于720×1280图片,参数量为24.8M ,计算量为674.6 GFLOPs;对于480×720图片,参数量为24.8M,计算量为254.9 GFLOPs。结果表明,本文提出的模块更加高效。

3 三个模型在交互域变化时的性能变化分析

ST factorised表示本文的交互域分解为单独的时间和单独的空间,Lite模型表示使用更低的嵌入空间维度(论文中未使用Lite模型),可以发现两个模型的变体在长时空条件下,均可从不同角度降低计算复杂度,并且性能相比于之前的模型仍然有提升。

4 和State-of-the-arts方法对比

实验主要在Volleyball dataset和Collective Activity dataset两个业界权威数据集上进行对比。可以发现作者提出的方法具有更加卓越的性能。

本文提出的动态时空推理网络,通过在初始化的交互域中进行关系推理和全局游走的预测,其在群体行为识别中达到了SOTA的效果,并且推理模块计算开销显著减小。同时,群体行为识别还有很多可以尝试的方面包括更具有挑战性的数据集、动态融入全局上下文的方法等。

Guess you like

Origin blog.csdn.net/Mr___WQ/article/details/129367997