Space-time graph convolution ST-GCN theory and code detailed explanation

1 Introduction

Skeleton-Based Action Recognition (Skeleton-Based Action Recognition) The main task is to recognize the executing action from a series of temporally continuous skeleton key points (2D/3D) . Because it involves the input of the graph structure of the skeleton frame, the method of using GCN has gradually become the mainstream, and has achieved good results.

Before learning ST-GCN, I found some GCN-related tutorials and articles on the Internet to study. Now the recommended series of articles are organized as follows, and you can read them by yourself:

Relatively easy-to-understand GCN analysis
（https://www.zhihu.com/question/54504471/answer/611222866）
A more complete GCN analysis
（https://zhuanlan.zhihu.com/p/90470499）

Here I briefly summarize the basic GCN steps (assuming the graph input is), which can be regarded as

Perform feature extraction on the graph input (assuming the parameters are ), and output. From a microscopic point of view, this feature extraction can be understood as extracting the features of each node on the graph separately, and its feature dimensions change from to;
Establish an adjacency matrix according to the graph structure, and normalize it or symmetrically normalize it to obtain;
The extracted features are aggregated using the normalized adjacency matrix, and the aggregated result is .

In this way, the basic graph convolution operation is implemented. The specific implementation code is as follows:

class GraphConvolution(nn.Module):
    def __init__(self, input_dim, output_dim, use_bias=True):
        """图卷积：L*X*\theta
        Args:
        ----------
            input_dim: int
                节点输入特征的维度
            output_dim: int
                输出特征维度
            use_bias : bool, optional
                是否使用偏置
        """
        super(GraphConvolution, self).__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.use_bias = use_bias
        self.weight = nn.Parameter(torch.Tensor(input_dim, output_dim))
        if self.use_bias:
            self.bias = nn.Parameter(torch.Tensor(output_dim))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight)
        if self.use_bias:
            init.zeros_(self.bias)

    def forward(self, adjacency, input_feature):
        """邻接矩阵是稀疏矩阵，因此在计算时使用稀疏矩阵乘法
    
        Args: 
        -------
            adjacency: torch.sparse.FloatTensor
                邻接矩阵
            input_feature: torch.Tensor
                输入特征
        """
        device = "cuda" if torch.cuda.is_available() else "cpu"
        support = torch.mm(input_feature, self.weight.to(device))
        output = torch.sparse.mm(adjacency, support)
        if self.use_bias:
            output += self.bias.to(device)
        return output

Closer to home, let's start with ST-GCN, whose paper name and code link are as follows:

论文名：Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
Code address: https://github.com/yysijie/st-gcn

There are also some small partners on the Internet who have done relevant analysis, and they think it is quite good. The links are as follows:

https://www.zhihu.com/question/276101856/answer/638672980

Next, we will analyze ST-GCN from (1) data input (2) network structure by combining papers and codes.

2. Data entry

2.1 Data Structure

The general input for skeleton-based action recognition methods is temporally continuous human skeleton keypoints, as shown in Figure 1 below.

figure 1

These key points can be obtained by pose estimation through openpose, or they can be manually annotated. Its data dimensions are generally (N, C, T, V, M ), of which (refer to the above cited article):

“

N represents the number of videos, usually a batch has 256 videos (in fact, it is set arbitrarily, preferably an index of 2);

C represents the feature of the joint, usually a joint contains 3 features such as x, y, acc (4 if it is a three-dimensional bone), x, y are the position coordinates of the node joint, and acc is the confidence level.

T represents the number of key frames, generally a video has 150 frames.

V represents the number of joints, usually one person labels 18 joints.

M represents the number of people in a frame, and generally select the 2 people with the highest average confidence.

Need to pay attention to C (feature), T (time), V (space).

2.2 Data preprocessing

In fact, the above input data (N, C, T, V, M) need to be normalized before being input to the ST-GCN network.

The normalization is performed in the time dimension, specifically, normalizing the feature values of a node in all T key frames . The specific implementation code is as follows:

# data normalization
N, C, T, V, M = x.size()
x = x.permute(0, 4, 3, 1, 2).contiguous()
x = x.view(N * M, V * C, T)
x = self.data_bn(x)
x = x.view(N, M, V, C, T)
x = x.permute(0, 1, 3, 4, 2).contiguous()
x = x.view(N * M, C, T, V)

The function data_bn is defined as follows:

self.data_bn = nn.BatchNorm1d(in_channels * A.size(1))

2.3 Graph partition strategy

In the ST-GCN article, another major innovation of the author is the introduction of a graph partition strategy through the analysis of motion, that is, the establishment of multiple adjacency matrices that reflect different motion states (such as rest, eccentric motion, and centripetal motion). . The author mentioned in the original text that he adopted three different strategies, namely:

Uni-labeling, that is, all nodes adjacent to the root node have the same label, as shown in Figure b below.
Distance partitioning, that is, the label of the root node itself is set to 0, and its adjacent points are set to 1, as shown in Figure c below.
Spatial configuration partitioning is the graph partitioning strategy proposed in this paper. That is, based on the distance between the root node and the center of gravity (label=0), among the distances from all adjacent nodes to the center of gravity, those smaller than the reference value are regarded as the center point (label=1) , and those greater than the reference value are regarded as centrifugal nodes . (label=2) .

figure 2

The specific code implementation is as follows:

A = []
for hop in valid_hop:
    a_root = np.zeros((self.num_node, self.num_node))
    a_close = np.zeros((self.num_node, self.num_node))
    a_further = np.zeros((self.num_node, self.num_node))
    for i in range(self.num_node):
        for j in range(self.num_node):
            if self.hop_dis[j, i] == hop:
                if self.hop_dis[j, self.center] == self.hop_dis[
                        i, self.center]:
                    a_root[j, i] = normalize_adjacency[j, i]
                elif self.hop_dis[j, self.
                                  center] > self.hop_dis[i, self.
                                                         center]:
                    a_close[j, i] = normalize_adjacency[j, i]
                else:
                    a_further[j, i] = normalize_adjacency[j, i]
    if hop == 0:
        A.append(a_root)
    else:
        A.append(a_root + a_close)
        A.append(a_further)
A = np.stack(A)

It is worth noting that hop is similar to the kernel size in CNN. hop=0 is the root node itself, and hop=1 represents the adjacent points between the root node and its distance equal to 1, which is the red dotted box in the above figure (a).

In order to better understand the code, we default to the root node in the above two loops. Because of the condition ***if self.hop_dis[j, i] == hop*** restriction, it can be regarded as the root node itself (hop=0) or its adjacent nodes (hop=1).

3. Network structure

Skeletal input data has temporal and spatial properties that are critical for motion detection. Therefore, it is proposed that ST-GCN should have the ability to extract features from the spatiotemporal dimension, and its performance in GCN is that it can aggregate information from the spatiotemporal dimension at the same time , as shown in the following figure.

image 3

More specifically, we give the specific structure diagram of ST-GCN, as shown in the following figure.

Figure 4

It can be divided into the following steps:

Step 1: Introduce a learnable weight matrix (of the same size as the adjacency matrix) that is bitwise multiplied by the adjacency matrix. This weight matrix is called "Learnable edge importance weight" and is used to give greater weight to important edges (nodes) in the adjacency matrix and suppress the weight of unimportant edges (nodes) .
Step 2: Send the weighted adjacency matrix and input to GCN for operation. At the same time, the author also introduced a residual structure (a CNN+BN) to calculate the Res, which is added to the output of GCN bit by bit to realize the aggregation of spatial dimension information.
Step 3: Use the TCN network (actually a common CNN, with kernel size>1 in the time dimension) to aggregate information in the time dimension.

The code implementation of the above ST-GCN module is as follows:

def forward(self, x, A):

    res = self.residual(x)
    x, A = self.gcn(x, A)
    x = self.tcn(x) + res

    return self.relu(x), A

The residual structure self.residual is defined as follows:

self.residual = nn.Sequential(
    nn.Conv2d(
        in_channels,
        out_channels,
        kernel_size=1,
        stride=(stride, 1)),
    nn.BatchNorm2d(out_channels),
)

GCN is defined as follows:

self.conv = nn.Conv2d(
        in_channels,
        out_channels * kernel_size,
        kernel_size=(t_kernel_size, 1),
        padding=(t_padding, 0),
        stride=(t_stride, 1),
        dilation=(t_dilation, 1),
        bias=bias)

def forward(self, x, A):
    assert A.size(0) == self.kernel_size

    x = self.conv(x)

    n, kc, t, v = x.size()
    x = x.view(n, self.kernel_size, kc//self.kernel_size, t, v)
    x = torch.einsum('nkctv,kvw->nctw', (x, A))

    return x.contiguous(), A

TCN is defined as follows

self.tcn = nn.Sequential(
    nn.BatchNorm2d(out_channels),
    nn.ReLU(inplace=True),
    nn.Conv2d(
        out_channels,
        out_channels,
        (kernel_size[0], 1),
        (stride, 1),
        padding,
    ),
    nn.BatchNorm2d(out_channels),
    nn.Dropout(dropout, inplace=True),
)

In fact, this paper proposes to continuously extract high-level semantic features from the graph structure input by continuously stacking ST-GCN, as follows:

self.st_gcn_networks = nn.ModuleList((
    st_gcn(in_channels, 64, kernel_size, 1, residual=False, **kwargs0),
    st_gcn(64, 64, kernel_size, 1, **kwargs),
    st_gcn(64, 64, kernel_size, 1, **kwargs),
    st_gcn(64, 64, kernel_size, 1, **kwargs),
    st_gcn(64, 128, kernel_size, 2, **kwargs),
    st_gcn(128, 128, kernel_size, 1, **kwargs),
    st_gcn(128, 128, kernel_size, 1, **kwargs),
    st_gcn(128, 256, kernel_size, 2, **kwargs),
    st_gcn(256, 256, kernel_size, 1, **kwargs),
    st_gcn(256, 256, kernel_size, 1, **kwargs),
))

# initialize parameters for edge importance weighting
if edge_importance_weighting:
    self.edge_importance = nn.ParameterList([
        nn.Parameter(torch.ones(self.A.size()))
        for i in self.st_gcn_networks
    ])
else:
    self.edge_importance = [1] * len(self.st_gcn_networks)

# ST-GCN与可学习的权重矩阵不断重复与堆叠
for gcn, importance in zip(self.st_gcn_networks, self.edge_importance):
 x, _ = gcn(x, self.A * importance)

After that, similar to the general classification task, the author introduces the global average pooling and the output prediction branch of the fully convolutional layer, as follows:

# global pooling
x = F.avg_pool2d(x, x.size()[2:])
x = x.view(N, M, -1, 1, 1).mean(dim=1)

# prediction
x = self.fcn(x)
x = x.view(x.size(0), -1)

So far, we can easily understand the specific network structure of ST-GCN through the code.

Summarize

So far, the analysis of ST-GCN is over, I hope it can help everyone! You are also welcome to pay attention to my WeChat public account!