Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Summary:

Dynamic human skeleton model with important information for identifying the operation, conventional methods typically used for skeletal modeling feature or manual traversal rules, thereby limiting the ability to express and difficult to generalize.

The authors propose a novel dynamic skeleton model ST-GCN, it can automatically learn patterns in space and time from the data, which makes the model has strong skills and generalization ability.

achieve substantial improvements over mainstream methods in the NTU-RGBD Kinetics and two sets of data (as compared to the main method, to obtain a qualitative improvement)

introduction:

(Introduction is a summary of the detailed extended version, it can be seen from the introduction, to be a problem with a new approach to investigate other solutions to this problem, pointing out the shortcomings of the current method, in addition to the need for the introduction of this new approach points out in this way their predecessors did not solve the problem, then introduce their own methods)

Motion recognition may be to identify a variety of forms from the human body, such as the shape, depth, optical flow, body frame, but the more people study the shape and flow of light, less dynamic framework for research contains a large amount of information, the authors proposed a species principled and efficient way of modeling the dynamic backbone

Today the study of the human skeleton dynamic: Early algorithm uses a dynamic framework timing information while ignoring the spatial information; most of the proposed algorithm then rely on the principle of artificial developed to analyze the spatial pattern of the skeleton, so only for a particular methods of hard to generalize.

On a need for a method that can automatically capture (nested joint space structures and their dynamic timing) in the model.

GCN be applied as image classification, document classification, semi-supervised learning task, but many tasks are fixed to a map as input. Performed with GCN to FIG dynamic modeling on large data sets have not been studied, such as the human framework sequences.

FIG extended to space-time model of FIG convolutional network by, a design (for motion recognition sequence backbone) is a generic representation, the network is called space-time convolution FIG.

Model based on the sequence of the configuration of FIG backbone, wherein each point corresponds to a node of each of the human skeletal structure. There are two types of side A side space (spatial edges) which establish a connection node of each frame on the nature of the human skeleton, the other side is a timing (temporal edges) connected to the same node it two consecutive frames stand up. Space-time diagram convolution of the network based on their many layers also established, which makes the information in time and space domain are integrated up.

Level of ST-GCN avoid the manual design traversal rules, which not only makes more expressive and better performance, but also very easy to generalize in different scenarios. GCN on a common expression, inspired by the image of the model, learning a new strategy to design convolution kernel.

to sum up:

  1. Proposed ST-GCN, a common framework to FIG based on the dynamic modeling human skeleton
  2. Made several design rules convolution kernel in ST-GCN skeleton model to meet the special requirements
  3. Compared with the original part based on manual traversing rules and algorithms, the method proposed by the authors achieved a superior performance in both large-scale operation to identify data sets based skeleton, and greatly reduce the manual design.

Related work:

Figure convolution mainstream network There are two main methods:

  1. Spectrum viewpoint position information in FIG convolution (spectral perspective) is seen in the form of spectral analysis.
  2. From the viewpoint of space (spatial perspective) convolution kernel is applied directly in the graph nodes and their neighbors.
  3. The authors used the second approach, limiting each filter is applied only to a node of a neighborhood

Motion recognition based skeleton:

  1. Manual feature based methods, several design features to capture hand the connection point of the motion information, for example, track joints covariance matrix
  2. The method based on the depth of learning, recurrent neural network, recognition is operated end (Among these approaches, many have emphasized the importance of modeling the joints within parts of human bodies. But these parts are usually explicitly assigned using domain knowledge.)
  3. On the first convolution FIG backbone network application based on motion recognition task. It is different from previous methods, may implicitly combine dynamic position information and timing information network convolution FIG.

Space-time diagram convolution Network

Proposed in the conventional method based on the skeleton identification operation, the information of the body parts is very effective for motion recognition based on the skeleton. The authors suggest that improved performance is mainly body parts having more local features than the entire skeleton, so there is a hierarchy of information represented by the local backbone sequence and therefore with ST-GCN

1. pipeline

Skeleton gives a motion video sequence information, the first configuration diagram showing the structure of the backbone sequence information, ST-GCN joint coordinate vector input is node on the graph, and then a series of space-time diagram convolution to extract high-level feature Finally, to obtain the operation corresponding to the classification by the classifier SofMax. The whole process to achieve the end of training.

2. The configuration of the backbone structure of FIG.

Referred to a \ (N \) nodes and \ (T \) temporal bone picture shows the sequence of frames \ (G = (V, E) \) , which is a set of nodes \ (V = \ left \ { v_ { } Ti | T =. 1, \ ldots, T, I =. 1, ..., N \ right \} \) , the first \ (T \) of frame \ (I \) eigenvectors nodes \ (F. \ left (v_ {ti} \ right) \) by a coordinate vector and estimate the confidence of the node components.

FIG structure consists of two parts:

  • According to the body structure, the connection node of each frame as edges that form Edges Spatial \ (of E_ {S} = \ left \ {V_ V_ {Ti} {TJ} | (I, J) \ in H \ right \ } \) H is a set of natural human joint connection
  • The two consecutive frames in the same node is connected to the edges that form Edges temporal \ (of E_ {} = F. \ Left \ V_ {{{Ti} _ V (T +. 1) I} \ right \} \)

3. FIG spatial convolution network

FIG convolution model discussed here only for the operation of a frame

In the usual two-dimensional convolution of the image, for example, for a position \ (\ mathbf {x} \ ) convolution output can be written as
\ [f_ {out} (\ mathbf {x}) = \ sum_ { h = 1} ^ {K} \ sum_ {w = 1} ^ {K} f_ {in} (\ mathbf {p} (\ mathbf {x}, h, w)) \ cdot \ mathbf {w} (h , w) \]
enter the number of channels \ (C \) is characterized in FIG \ (in F_ {} \) , the size of the convolution kernel \ (K * K \) , sampling function sampling function \ (\ mathbf {p} ( \ mathbf {X}, H, W) = \ mathbf {X} + \ mathbf {P} ^ {\ Prime} (H, W) \) , weight function channel number \ (C \) weighting function.

(1)sampling function

In the image, the sampling function \ (\ mathbf {p} ( h, w) \) are defined by \ (X \) neighbors of center pixel, in the figure, the pixels neighbor set is defined as: \ ( B \ left (V_ {Ti} \ right) = \ left \ {V_ {TJ} | D \ left (V_ {TJ}, V_ {Ti} \ right) \ Leq D \ right \} \) , \ (D (v_ {tj}, v_ { ti}) \) refers to the \ (v_ {tj} \) to \ (v_ {ti} \) of the shortest distance, thus the sampling function can be written \ (\ mathbf {p} \ left (V_ {Ti}, V_ {TJ} \ right) = V_ {TJ} \) , where \ (\ mathbf {p} \ left (v_ {ti}, v_ {tj} \ right) = v_ { } TJ \) .

(2)weight function

2D convolution in the neighbor pixels are regularly arranged around the center pixel, it is possible to check that performs convolution with the spatial order convolutional rules. 2D convolution analogy, in the drawing, the sampling function obtained neighbor pixels divided into different subsets, each subset has a number label, therefore \ (l_ {ti}: B \ left (v_ {ti} \ right) \ rightarrow \ {0, \ ldots, K-1 \} \) maps a neighbor node to a corresponding subset of label, the weight equation \ (\ mathbf {w} \ left (v_ {ti}, v_ { tj} \ right) = \ mathbf {w} ^ {\ prime} \ left (l_ {ti} \ left (v_ {tj} \ right) \ right) \)

(3)空间图卷积
\[ f_{o u t}\left(v_{t i}\right)=\sum_{v_{t j} \in B\left(v_{t i}\right)} \frac{1}{Z_{t i}\left(v_{t j}\right)} f_{i n}\left(\mathbf{p}\left(v_{t i}, v_{t j}\right)\right) \cdot \mathbf{w}\left(v_{t i}, v_{t j}\right) \]
其中归一化项\(Z_{t i}\left(v_{t j}\right)=\left|\left\{v_{t k} | l_{t i}\left(v_{t k}\right)=l_{t i}\left(v_{t j}\right)\right\}\right|\),等价于对应子集的基。将上述公式带入上式得到:
\[ f_{o u t}\left(v_{t i}\right)=\sum_{v_{t j} \in B\left(v_{t i}\right)} \frac{1}{Z_{t i}\left(v_{t j}\right)} f_{i n}\left(v_{t j}\right) \cdot \mathbf{w}\left(l_{t i}\left(v_{t j}\right)\right) \]
(4)时空模型

The extended model space domain into the time domain, the obtained sampling function is \ (B \ left (v_ { ti} \ right) = \ left \ {v_ {qj} \ left | d \ left (v_ {tj}, {Ti} V_ \ right) \ K Leq, \ right | Qt | \ Leq lfloor rfloor \ right \} \) \ \ the Gamma / 2 \ , \ (the Gamma \) \ controlling a time domain convolution kernel size, weight function is \ (l_ {ST} \ left (v_ {qj} \ right) = l_ {ti} \ left (v_ {tj} \ right) + (q-t + \ lfloor \ Gamma / 2 \ rfloor) \ times K \ ) , read

4. Mode Split subsets

(1) dividing the sole Uni-labeling: 1 the neighbor node is divided into a subset of

(2) based on the distance divided Distance partitioning: the one neighbor node is divided into two subsets, the subset of the node itself and the adjacent node subsets

(3) dividing the steric configuration Spatial configuration partitioning: the one neighborhood node is divided into three subsets, the first subset of the connected further away from the root node than neighboring nodes entire skeleton spatial positions, connected to a second subset closer to the center of the neighbor nodes, the third subset is the root node itself, respectively, centrifugal motion, moving and stationary centripetal motion feature

5. attention mechanism

During exercise, the importance of different torso is different. For example leg movements may be more important than the neck, through the legs we can even determine the running, walking and jumping, but the action of the neck may not contain much useful information.

Therefore, ST-GCN different weighted torso (st-gcn each cell has its own weighting parameter used for training)

6. ST-GCN achieved

实际项目中使用的图卷积公式是
\[ (x)=D^{-1} A X \]
化简:
\[ \begin{aligned} \text {aggregate}\left(X_{i}\right) &=D^{-1} A X \\ &=\Sigma_{k=1}^{N} D_{i k}^{-1} \Sigma_{j=1}^{N} A_{i j} X_{j} \\ &=\Sigma_{j=1}^{N} D_{i i}^{-1} A_{i j} X_{j} \\ &=\Sigma_{j=1}^{N} \frac{A_{i j}}{D_{i i}} X_{j} \\ &=\Sigma_{j=1}^{N} \frac{A_{i j}}{\Sigma_{k=1}^{N} A_{i k}} X_{j} \end{aligned} \]
论文中的公式(不太懂):
\[ \mathbf{f}_{o u t}=\mathbf{\Lambda}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I}) \mathbf{\Lambda}^{-\frac{1}{2}} \mathbf{f}_{i n} \mathbf{W} \]
其中,\(\Lambda^{i i}=\sum_{j}\left(A^{i j}+I^{i j}\right)\)

当采用第二种和第三种划分策略时,\(A+I=\sum_{j} A_{j}\)
\[ \mathbf{f}_{o u t}=\sum_{j} \mathbf{\Lambda}_{j}^{-\frac{1}{2}} \mathbf{A}_{j} \mathbf{\Lambda}_{j}^{-\frac{1}{2}} \mathbf{f}_{i n} \mathbf{W}_{j} \]
其中,\(\Lambda_{j}^{i i}=\sum_{k}\left(A_{j}^{i k}\right)+\alpha\)\(\alpha=0.001\)

增加注意力机制后,上式中的\(\mathbf{A}_{j}=\mathbf{A}_{j} \otimes \mathbf{M}\)\(\otimes\)表示点积

网络结构与训练

输入的数据首先进行batch normalization,然后在经过9个ST-GCN单元,接着是一个global pooling得到每个序列的256维特征向量,最后用SoftMax函数进行分类,得到最后的标签。每一个ST-GCN采用Resnet的结构,前三层的输出有64个通道,中间三层有128个通道,最后三层有256个通道,在每次经过ST-CGN结构后,以0.5的概率随机将特征dropout,第4和第7个时域卷积层的strides设置为2。用SGD训练,学习率为0.01,每10个epochs学习率下降0.1。

在训练Kinetcis数据集时,采用两种策略代替dropout层:1. random moving:在所有帧的骨架序列上应用随机仿射变换,fixed angle、translation、scaling factors。2. 在训练中随机抽取原始骨架序列的片段,并在测试中使用所有帧不太懂

实验

实验数据集:

Kinetics human action dataset 和 NTU-RGB+D

实验环境:

8 TITANX GPUs 和 PyTorch

Kinetics:

300,000个视频序列,400类动作,每个视频持续10秒,unconstraint

数据处理流程:resize(340*256)-->30fps-->OpenPose-->18个节点的二维坐标+置信度-->(3,T,18,2)

其中T=300,3表示二维坐标+置信度,18表示节点数目,2表示置信度最高的两个人

测评指标:top-1和top-5

240,000个视频训练,20,000个视频验证

NTU-RGB+D

56,000个视频,60类动作,由40个志愿者完成,constraint

25个节点,每个节点用三维坐标表示,每个clip最多有2个对象

测评指标:cross-subject 40320训练 16560测试 cross-view 37920训练 18960测试 top-1

Ablation Study

Baseline TCN:Interpretable 3d human action analysis with temporal convolutional networks.(等价于没有共享参数的全连接时空图网络,网络图与ST-GCN不一样)

Loacl Convolution:(没有共享参数的ST-GCN,网络图就是ST-GCN)

Distance partitioning*:bind the weights of the two subsets in distance partitioning to be different only by a scaling factor -1, or w0 = -w1.

ST-GCN+Imp:ST-GCN+注意力机制

comparison with state of the art

Kinetic:

NTU-RGB+D:

讨论

table4:去掉了对象和环境进行交互的视频,留下了人体动作相关的视频进行测试

table5:two-stream style action recognition,不同的输入特征测试

TSN:Temporal segment networks: Towards good practices for deep action recognition.

参考博客:
https://blog.csdn.net/qq_36893052/article/details/79860328

https://www.zhihu.com/question/276101856/answer/385251705

图网络的简单实现

看代码需要理解的问题:图数据的组织形式,卷积核,大小,channel,pad,stride,遍历规则,权重值,正反向传递规则

Guess you like

Origin www.cnblogs.com/shyern/p/11262926.html