Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recogonition

1 Introduction

This paper proposes a solution to the problem of human action recognition. Based on Skeleton, a spatio-temporal map is constructed, and ST-GCN (Spatial Temporal GCN) is proposed to characterize the spatio-temporal map of human behavior and perform behavior classification.

2 Problem definition

Given a video, determine the category of human behavior. Human behavior can be expressed from many aspects, such as optical flow, bone key points, etc. In this paper, human behavior is expressed in the form of bone key points. In a video, the human bone points in each frame can form a graph — Spatial graph, and the same skeleton point between frames can form a graph — Temporal graph. The goal of the thesis is to classify human behavior based on the Spatial Temporal graph composed of skeletal points in a video.

3 DySAT ideas

Insert picture description here

3.1 Skeleton Graph Construction

Spatial Temporal Graph G = ( V , E ) G = (V, E) G=( V ,E )V = {vti ∣ t = 1,. . . , T, i = i,. . . , N} V = \ {v_ {ti} | t = 1, ..., T, i = i, ..., N \}V={ vt it=1,...,T,i=i,...,The nodes in N } come from the key points of human bones,NNN represents the number of key points,TTT represents the number of video frames. The graph construction in each frame is based on the natural connection of the human body, and the graph construction method between frames is: the same key point is connected between adjacent frames. Therefore, there are two types of edges:ES = {vtivtj ∣ (i, j) ∈ H}, EF = {vtiv (t + 1) i} E_S = \{v_{ti}v_{tj} | (i, j) \in H\}, E_F = \{v_{ti}v_{(t+1)i}\}ES={ vt ivt j(i,j)H},EF={ vt iv(t+1)i} , WhereHHH represents the natural connection of key points in the human body.

3.2 Spatial Graph Convolutional Neural Network

The author defined Spatial Temporal Convolution on the basis of Spatial Temporal Graph. This convolution operation is very similar to the convolution in GCN, but adds a time dimension—the neighbors of each node must consider not only the neighboring nodes in the current frame but also the connection of adjacent frames in time. For each node in the figure, its characteristic is the position of the bone key point in the frame.

After multiple layers of Spatial Temporal Convolution, input the features into MLP for behavior classification.

4 Advantages and limitations of the method

4.1 Advantages

  • Applying GCN to behavior recognition for the first time
  • Not only consider the key points in the frame, but also consider the key points between frames

4.2 Limitations

  • The attention mechanism mentioned in the paper has no obvious meaning
  • Only local features are considered


Welcome to visit my personal blog ~~~
and my public account [learn AI together]
Insert picture description here

Guess you like

Origin blog.csdn.net/Miha_Singh/article/details/114377278