Deep learning: STGCN study notes

Graph Classification Problem Based on Graph Neural Network

  • The essential work of the graph neural network is feature extraction, and graph embedding is implemented at the end of the graph neural network (converting the graph into a feature vector).
  • Process : The first choice is to use the graph neural network to extract the features of the graph, and finally perform the readout operation (convert the information of the graph into an n-dimensional feature vector that can be directly used for classification), and then send it to softmax for classification to realize the classification of the graph (if It should be well understood if you have experience in traditional convolutional neural network classification).
    insert image description here
    In this article, the graph classification is specifically represented as the following figure (it can be compared with the above figure, which is helpful for understanding).

Input Video can form a graph based on the key points of the human body as input data, and the
ST-GCNs in the middle are responsible for feature extraction,
and finally the readout is used for softmax classification.
insert image description here

Application of GCN in the field of behavior recognition

Study article: GCN Behavior Recognition Application
Content: Selected representative working papers that have combined GCN and behavior recognition since 18 years to discuss and analyze the core ideas of these works, as well as ideas that can be tried on this basis.

ST-GCN explanation + combined code
stgcn interpretation in detail

main mission

The main task of behavior recognition is classification recognition. For a given piece of action information (such as video, picture, 2D skeleton sequence, 3D skeleton sequence), its category is predicted by feature extraction and classification. At present (after 18 years), the mainstream method based on video and RGB images is the two-stream dual-stream network, and the mainstream method based on bone data is the graph convolutional network.

Research ideas

The skeleton map of the human body is itself a topological map, so it is a very reasonable idea to apply GCN to action recognition.
But different from traditional graph structure data, human motion data is a series of time series, with spatial characteristics at each time point, and temporal characteristics between frames,How to comprehensively explore the spatiotemporal characteristics of motion through graph convolutional networks is a current research hotspot in the field of behavior recognition.

The author selected the representative work combining GCN and behavior recognition since 18 years to discuss and analyze the core ideas of these works, as well as ideas that can be tried on this basis.

ST-GCN(SpatialTemporal Graph Convolutional Networks for Skeleton-Based Action Recognition)解读

original paper

Original paper
This algorithm models dynamic bones based on the time series representation of human joint positions, and extends the graph convolution to a spatiotemporal graph convolutional network to capture this spatiotemporal relationship.

The author proposes a novel dynamic skeleton model ST-GCN, which can automatically learn spatial and temporal patterns from data , which makes the model have strong expressive ability and generalization ability

Solve the problem

It is used to solve the problem of human action recognition based on key points of human skeleton

1. Use OpenPose to process the video and propose a data set
2. Combine GCN and TCN to propose a model, and the effect on the data set is not bad

main contribution

  • 1. Extend the graph convolutional network to the spatiotemporal domain, called spatiotemporal graph convolutional network (ST-GCN).
    For each joint, not only its adjacent joints in space, but also its adjacent joints in time are considered, that is to say, the concept of neighborhood is extended to time.

  • 2. New weight distribution strategy, three different weight distribution strategies are mentioned in the article:

  • insert image description here

Figure (b) single-label division strategy, which divides the node and its 1-neighborhood nodes into the same subset, so that they have the same label and naturally have the same weight. In this case, the weight in each kernel is actually a 1*N vector, and N is the feature dimension of the node.

Graph © is divided by distance, the nodes themselves are divided into a subset, and the 1 domain is divided into a subset. The weight of each kernel is a 2*N vector.

Figure (d) Space configuration division, division of the distance between nodes and the center of gravity, 1-neighborhood nodes closer to the center of gravity (relative to the central node) are a subset, 1-neighborhood nodes farther away from the center of gravity are a subset, and the center Node itself is 1 subset. The weight of each kernel is a 3*N vector.

After testing, it is found that the third strategy works best, because the third strategy actually includes the idea of ​​paying more attention to the joints of the extremities. Usually, the closer the distance to the center of gravity, the smaller the range of motion, and at the same time, it can better Distinguish between centripetal and centrifugal motion.

main idea

1. Extend the graph convolution to the time domain, so as to better explore the motion characteristics of the action, not just the spatial characteristics.

2. A new weight distribution strategy is designed to learn the characteristics of different nodes more differentiatedly.

3. Reasonable use of prior knowledge,More attention is paid to the joints with large range of motion, which is potentially reflected in the weight distribution strategy.

Introduction

insert image description here

The model is formulated on a sequence of bone graphs, where each node corresponds to a joint of the human body. There are two types of edges in the graph, spatial edges that conform to the natural connections of joints and temporal edges that connect the same joints in consecutive time steps . On this basis, a multi-layer spatio-temporal graph convolution is constructed, which allows information to be integrated along two dimensions of space and time.

OpenPose preprocessing

openpose is an open source human body key point detection tool. It is an algorithm that marks the joints of the human body (neck, shoulders, elbows, etc.), connects them into bones, and then estimates the posture of the human body. As a video preprocessing tool, we only need to pay attention to the output of OpenPose.
ST-GCN directly uses the openpose tool for human key point extraction (dividing the input video into multiple frames, performing key point detection on each frame, and It is packaged for subsequent operations)

In general, the dimensionality of the bone annotation results of the video is relatively high. In a video, there may be many frames (Frame). In each frame, there may be many people (Man). Each person has many joints. Each joint has different characteristics (position, confidence).
insert image description here
For a batch of videos, we can use a 5-dimensional matrix (N, C, T, V, M) to represent.

	 N.代表视频的数量,通常一个 batch 有 256 个视频(其实随便设置,最好是 2 的指数)。
	C 代表关节的特征,通常一个关节包含  等 3 个特征(如果是三维骨骼就是 4 个)。
	 T代表关键帧的数量,一般一个视频有 150 帧。
	 V代表关节的数量,通常一个人标注 18 个关节。
	 M代表一帧中的人数,一般选择平均置信度最高的 2 个人。

Therefore, the output of OpenPose is the input of ST-GCN

Construct graph based on key points of human body

Now we have the key point information of the human body in different frames in a video. We construct an undirected spatiotemporal graph G = (V, E) on the skeleton sequence, which contains N joints and T frames, with both in vivo and inter-frame connections, that is, the time-space graph

Combination method: use each key point of the human body in a frame as a node, and the natural connection and time domain connection between the key points of the human body as an edge to form a graph (which can be understood as a three-dimensional graph), as shown in the figure below.
insert image description here

Construct a single frame graph (spatial domain)

A graph can be expressed as G = (V, E), V is the node feature (node), E is the edge feature (edge)
where V = {vti|t = 1, . . . , T, i =1, . . . , N}
That is, vti represents different node features, where t represents nodes of different frames (ie, time domain), i represents different human body key points (nodes) in the same frame, and the dimension of vti is (x, y, confidence), where x, y are the coordinates of the key point, and confidence is the confidence of the key point.

According to the V node, a single-frame graph (that is, the dark blue part of the above figure) can be formed.

Construct inter-frame graph (time domain)

Find the same nodes in consecutive frames and connect them into time domain information (edge ​​information).

Edge information consists of two subsets.

The first subset: the relationship between key points in the same frame (i represents different key points in the same frame, j represents the same key point between different frames) The second subset: the relationship between human joints between different
insert image description here
frames
insert image description here
So far, a graph containing spatial and temporal information is formed based on the key point information of the human body in different frames in a video.
Among them, the node information is (abscissa, ordinate, confidence level); the edge information is (ES, EF).

ST-GCN model

The ST-GCN model describes how to calculate the graph constructed by the above process.
The formula of the traditional convolutional network:
insert image description here
where, p is a sampling function, sampling adjacent pixels within the range of h and w around the x pixel for convolution operation; W is a weight function, which is used to perform the inner product operation with the input sampling feature The weight matrix of .

Extending this method to the graph,
the x pixel is equivalent to the node of the graph;
a pixel position includes n-dimensional features, which is equivalent to a node containing n-dimensional features.
The convolution operation on the graph is defined here, that is, the c-dimensional vector feature is obtained after the Vt node feature is subjected to the convolution operation.
insert image description here
Redefine the sampling function p and weight function w on the graph.

sampling function

In the traditional convolutional neural network, the sampling function can be understood as the size of the convolution kernel , that is, the range covered by each convolution operation (feature extraction). For example, a 3*3 convolution kernel, when performing a convolution operation on a certain pixel, actually calculates and aggregates the information of the pixel and its adjacent 8 pixel points.

In ST-GCN, the nodes are equivalent to the image pixels of the traditional convolution, and the sampling function is responsible for specifying the range of adjacent nodes involved in the graph convolution operation for each node. In this paper, D = 1, that is, a Adjacent nodes of order (directly connected nodes), expressed as follows, where d represents the distance between two nodes.
insert image description here
For D=1, the sampling function p can be written as the following formula, that is, only sampling directly adjacent nodes can be
insert image description here
visualized as shown in the figure below. This figure explains the partition strategy in the original text, but it is also more suitable for explaining the sampling function. If the red node is used as the center node of the calculation graph convolution, the sampling range is the nodes within the red dotted line, that is, the adjacent nodes with D=1.
insert image description here

weight function

  • In a traditional neural network, the weight function can be calculated by indexing a (c, K, K) dimensional tensor (that is, a convolution kernel of c * K * K) in a spatial order (such as from left to right, from top to bottom). accomplish.
  • For graph, there is no such default spatial arrangement, so you need to customize an arrangement.
    The method adopted in this article is to subset the adjacent nodes of a node in the graph, and each subset has a label, that is, to realize the mapping:
    insert image description here

Map adjacent nodes to the subset label it belongs to. The specific division rules are introduced in detail later.
So far, the weight function (W(vti, vtj) represents the weight vector Rc obtained based on a node and its neighborhood)
insert image description here
can be realized by directly subscripting a (c, K) tensor, or the following formula (lti(vtj) represents vtj In the vti as the center node, the tag in the molecular set tag) is realized
insert image description here

Spatial graph convolution

insert image description here

space-time model

In the previous section, we have obtained the convolution operation formula for the generalized graph, and now we will further optimize and update the graph convolution formula for the space-time graph in ST-GCN.
Define adjacent nodes on the time-space graph :
insert image description here
From the above formula, it can be seen that the definition of adjacent nodes is "less than K in spatial distance, and smaller than Γ/2 in frame distance", that is, in the definition of spatial neighborhood A time constraint is added to the above.

The sampling function and weight function introduced above are for spatial graphs, and temporal graph convolution also requires a set of sampling functions and weight functions. The basic principles are the same, just redefine the mapping function of label grouping, and the rest of the calculation methods are the same
insert image description here

partition strategy

The article mentions three partitioning strategies, as shown in the figure below. My understanding of the partition label is that the weight vector of the inner product is the same as the weight vector of the inner product during the convolution calculation of the nodes in the same partition. There are several weight vectors for several partitions.
insert image description here
(a) Schematic diagram of the input bone sequence, the red node is the central node of this convolution calculation, and the blue node in the red dotted line is its adjacent node for sampling.

(b) Single division: All the neighborhood nodes of the node are divided into a subset (including itself).
Disadvantage: Neighborhood nodes are inner producted with the same weight, and local differential attributes cannot be calculated.

© Division based on distance: the central node is one type, and the adjacent nodes (excluding itself) are another type

(d) Spatial configuration division (also the method really used in this paper, this distribution may be more representative of the centripetal and centrifugal movements of the key points of the human body): according to the definition of the centripetal and centrifugal relationship of the joint points, r represents the distance from the node to the center of gravity of the skeleton graph average distance. At this time, for the convolution operation of a node, its weight matrix includes three kinds of weight vectors.
insert image description here
Divide the 1-neighborhood of the node into 3 subsets, the first subset connects the neighbor nodes farther from the whole skeleton than the root node in spatial position, the second subset connects the neighbor nodes closer to the center, and the third subset It is the root node itself, representing the motion characteristics of centrifugal motion, centripetal motion and static motion respectively

Learnable Edge Importance Weights

Because when the human body is in motion, certain joints often move in groups (such as wrists and elbows) and may appear in various parts of the body, so the modeling of these joints should include different importance.

To this end, ST-GCN adds a learnable mask M to each layer, which measures the contribution of the node feature to its adjacent nodes based on the importance weight learned from the edge information in the skeleton graph.

That is: ST-GCN assigns a value to each edge in the graph composed of key points of the human body to measure the mutual influence between the two nodes connected by this edge, and this value is learned through graph edge information training owned.

This method improves the effect of ST-GCN. At the end of the article, it is said that the attention mechanism can be considered.

TCN

GCN helps us learn local features of adjacent joints in space .
On this basis, we need to learn the local features of joint changes in time .
How to superimpose timing features for Graph is one of the problems faced by the graph network . There are two main ideas for research in this area: Time Convolution (TCN) and Sequence Model (LSTM).

ST-GCN uses TCN, and since the shape is fixed, we can use traditional convolutional layers to complete temporal convolution operations. For ease of understanding, the convolution operation of an image can be compared. The shape of the last three dimensions of the feature map of st-gcn is (C, V, T), corresponding to the shape (C, W, H) of the image feature map.
The channel number C of the image corresponds to the feature number C of the joint.
The width W of the image corresponds to the number V of key frames.
The height H of the image corresponds to the number T of joints.

In image convolution, if the size of the convolution kernel is "w" or "1", the convolution of w rows of pixels and 1 column of pixels is completed each time. If "stride" is s, move s pixels each time, and then perform convolution of the next row of pixels after completing one row.
insert image description here
In temporal convolution, the size of the convolution kernel is "temporal_kernel_size" [formula] "1", and then the convolution of one node and temporal_kernel_size keyframes is completed each time. If "stride" is 1, it moves 1 frame at a time, and after completing 1 node, perform the convolution of the next node.

Implementation of the ST-GCN model

Graph convolution classic formula

Here is another graph convolution formula based on the frequency domain.
insert image description here
Among them, A is the adjacency matrix of the graph,
I is the identity matrix, and
A+I adds a self-loop to the graph to ensure the effectiveness of data transmission.
W is composed of weight vectors of multiple output channels.
fin is the input feature map, its dimension is (c, V, T), where V is the number of nodes and T is the number of frames

Graph convolution implementation

Graph convolution formula based on frequency domain
insert image description here
Implementation process: first perform 1 × Γ 2D standard convolution fin W, and then multiply by Λ− 1/2 (A + I)Λ− 1/2.

For processing with multiple subsetting strategies

The above formula is only applicable to the method of single subset division, that is, all W are the same. For the division strategy with multiple subsets, the adjacency matrix A is divided into multiple matrices Aj, where therefore, the previous frequency domain image
insert image description here
convolution The formula becomes
insert image description here
α is added here to prevent Λ from having a row of all 0s, otherwise there will be no way to find Λ-1/2.

Learnable Edge Importance Weights Implementation

ST-GCN is equipped with a learnable M for each adjacency matrix (representing the internal connection relationship of the graph).

For the formulas in 4.2 and 4.3, respectively change A+I and Aj to: A+I cross multiplies an M (element multiplication); Aj cross multiplies an M, as shown in the following formula.
insert image description here

ST-GCN network architecture and training

So far, the key points of the human body have been extracted based on openpose, the time-space graph is established according to the key points, and the convolution method on the time-space graph is redefined. Finally, the network architecture design and training design are carried out.

1. Since different nodes share weight matrices in the same layer of GCN, it is necessary to keep the size of the input data consistent . Therefore, Batch-Normalization is performed first when inputting data.

2. ST-GCN consists of 9 layers of ST-GCN modules. The first three layers output 64-dimensional features, the middle three layers output 128-dimensional features, and the last three layers output 256-dimensional features. The time kernel Γ of each layer is 9 , and each layer has a resnet mechanism
3. In order to avoid overfitting, dropout=0.5 is added to each layer, and the pooling stride=2 of the 4th and 7th layers.

4. Finally, the readout operation is performed on the graph, that is, the graph is embedded, the graph data is converted into an n-dimensional vector, and sent to the classic classifier softmax for graph classification.

5. During the retraining process, the stochastic gradient descent method is adopted, the initial learning rate is set to 0.01, and the attenuation is 0.1 times every 10 epochs.

6. In order to prevent overfitting, the data is enhanced . First, an affine transformation is performed on the skeleton sequence (simulating camera movement); and then the fragments are randomly selected in the original sequence for training.

Combining code analysis structure

network structure

Normalized

insert image description here
First, normalize the input matrix, the specific implementation is as follows:

N, C, T, V, M = x.size()
# 进行维度交换后记得调用 contiguous 再调用 view 保持显存连续
x = x.permute(0, 4, 3, 1, 2).contiguous()
x = x.view(N * M, V * C, T)
x = self.data_bn(x)
x = x.view(N, M, V, C, T)
x = x.permute(0, 1, 3, 4, 2).contiguous()
x = x.view(N * M, C, T, V)

Normalization is performed in both temporal and spatial dimensions (V x C). That is to normalize the position features (x, y and acc) of a joint in different frames.

The advantages of this operation far outweigh the disadvantages:

		关节在不同帧下的关节位置变化很大,如果不进行归一化不利于算法收敛
		
		在不同 batch 不同帧下的关节位置基本上服从随机分布,不会造成不同 batch 归一化结果相差太大,而导致准确率波动。

space-time transformation

Through the ST-GCN unit, alternately use GCN and TCN to transform the time and space dimensions:

# N*M(256*2)/C(3)/T(150)/V(18)
Input:[512, 3, 150, 18]
ST-GCN-1[512, 64, 150, 18]
ST-GCN-2[512, 64, 150, 18]
ST-GCN-3[512, 64, 150, 18]
ST-GCN-4[512, 64, 150, 18]
ST-GCN-5[512, 128, 75, 18]
ST-GCN-6[512, 128, 75, 18]
ST-GCN-7[512, 128, 75, 18]
ST-GCN-8[512, 256, 38, 18]
ST-GCN-9[512, 256, 38, 18]

The spatial dimension is the feature of the joints (starts at 3), and the temporal dimension is the number of keyframes (starts at 150). After the spatiotemporal convolution of all ST-GCN units, the feature dimension of the joints is increased to 256, and the keyframe dimension is reduced to 38.

I feel that this design is because there are not many human action stages, but the actions in each stage are more complicated. For example, an action of swinging a golf club may only need to be decomposed into 5 steps, but each step requires more movements of the hands, waist and feet.

output

Finally, use the average pooling and fully connected layer (or FCN) to classify the features. The specific implementation is as follows:

# self.fcn = nn.Conv2d(256, num_class, kernel_size=1)

# global pooling
x = F.avg_pool2d(x, x.size()[2:])
x = x.view(N, M, -1, 1, 1).mean(dim=1)
# prediction
x = self.fcn(x)
x = x.view(x.size(0), -1)

The average pooling on the Graph can be understood as the read out of the Graph, that is, the process of summarizing the characteristics of the nodes to represent the characteristics of the entire graph.
The read out here is the process of summarizing the joint features to represent the action features .
Usually we use statistics-based methods, such as max, sum, mean, etc. for nodes. mean is more robust, so mean is used here.

GCN-convolution kernel

From the results, the simplest graph convolution seems to have achieved good results. The specific implementation is as follows:

def normalize_digraph(A):
    Dl = np.sum(A, 0)
    num_node = A.shape[0]
    Dn = np.zeros((num_node, num_node))
    for i in range(num_node):
        if Dl[i] > 0:
            Dn[i, i] = Dl[i]**(-1)
    AD = np.dot(A, Dn)
    return AD

The graph convolution formula used by the author in the actual project is:
insert image description here
the formula can be simplified as follows:
insert image description here
In fact, it is to use the edge as the weight to calculate the weighted average of the node features. Among them, insert image description here
it can be understood as a convolution kernelinsert image description here

Multi-Kernal

insert image description here

Combining motion analysis research, the author divides it into three sub-graphs, expressing the motion characteristics of centripetal motion, centrifugal motion and static motion respectively.
insert image description here
For a root node, the edges connected to it can be divided into 3 parts. Part 1 connects neighbor nodes (yellow nodes) that are farther away from the center of gravity of the entire skeleton than this node in spatial position, and includes the characteristics of centrifugal motion. Part 2 connects the neighbor nodes (blue nodes) closer to the center of gravity, and contains the characteristics of centripetal motion. Section 3 connects the root node itself (the green node), and contains the static features.

Using this decomposition method, 1 graph is decomposed into 3 subgraphs. The convolution kernel has also changed from 1 to 3, that is, (1, 18, 18) has become (3, 18, 18). The convolution results of the three convolution kernels respectively express the action features of different scales. To get the result of convolution, you only need to use each convolution kernel to perform convolution separately, and perform weighted average (same as image convolution).



A = []
for hop in valid_hop:
    a_root = np.zeros((self.num_node, self.num_node))
    a_close = np.zeros((self.num_node, self.num_node))
    a_further = np.zeros((self.num_node, self.num_node))
    for i in range(self.num_node):
        for j in range(self.num_node):
            if self.hop_dis[j, i] == hop:
                if self.hop_dis[j, self.center] == self.hop_dis[
                        i, self.center]:
                    a_root[j, i] = normalize_adjacency[j, i]
                elif self.hop_dis[j, self.
                                  center] > self.hop_dis[i, self.
                                                         center]:
                    a_close[j, i] = normalize_adjacency[j, i]
                else:
                    a_further[j, i] = normalize_adjacency[j, i]
    if hop == 0:
        A.append(a_root)
    else:
        A.append(a_root + a_close)
        A.append(a_further)
A = np.stack(A)
self.A = A

Multi-Kernal GCN

insert image description here

TCN

GCN helps us learn local features of adjacent joints in space .
On this basis, we need to learn the local features of joint changes in time .
How to superimpose timing features for Graph is one of the problems faced by the graph network . There are two main ideas for research in this area: Time Convolution (TCN) and Sequence Model (LSTM).

ST-GCN uses TCN, and since the shape is fixed, we can use traditional convolutional layers to complete temporal convolution operations. For ease of understanding, the convolution operation of an image can be compared. The shape of the last three dimensions of the feature map of st-gcn is (C, V, T), corresponding to the shape (C, W, H) of the image feature map.
The channel number C of the image corresponds to the feature number C of the joint.
The width W of the image corresponds to the number V of key frames.
The height H of the image corresponds to the number T of joints.

In image convolution, if the size of the convolution kernel is "w" or "1", the convolution of w rows of pixels and 1 column of pixels is completed each time. If "stride" is s, move s pixels each time, and then perform convolution of the next row of pixels after completing one row.
insert image description here
In temporal convolution, the size of the convolution kernel is "temporal_kernel_size" [formula] "1", and then the convolution of one node and temporal_kernel_size keyframes is completed each time. If "stride" is 1, it moves 1 frame at a time, and after completing 1 node, perform the convolution of the next node.
The specific implementation is as follows:

padding = ((kernel_size[0] - 1) // 2, 0)

self.tcn = nn.Sequential(
    nn.BatchNorm2d(out_channels),
    nn.ReLU(inplace=True),
    nn.Conv2d(
        out_channels,
        out_channels,
        (temporal_kernel_size, 1),
        (1, 1),
        padding,
    ),
    nn.BatchNorm2d(out_channels),
    nn.Dropout(dropout, inplace=True),
)

Attention

Before the graph convolution, the author also designed a simple attention model (ATT). If you don't know the graph attention model, you can read it here.

# 注意力参数
# 每个 st-gcn 单元都有自己的权重参数用于训练
self.edge_importance = nn.ParameterList([
    nn.Parameter(torch.ones(self.A.size()))
    for i in self.st_gcn_networks
])
# st-gcn 卷积
for gcn, importance in zip(self.st_gcn_networks, self.edge_importance):
    print(x.shape)
    # 关注重要的边信息
    x, _ = gcn(x, self.A * importance)

During the movement, the importance of different trunks is different. For example, the movement of the legs may be more important than the neck. We can even judge running, walking and jumping through the legs, but the movement of the neck may not contain much effective information.

Therefore, ST-GCN weights different torsos (each st-gcn unit has its own weight parameters for training).

code analysis

Source code interpretation

Summarize

The last part of the paper is a variety of comparative experiments. It is concluded that ST-GCN has an excellent effect on the NTU-RGB+D data set, but it has a general performance in the kinetics data set. The paper analyzes that it may be because ST-GCN does not fully consider the characters. due to the interaction with the background.

Guess you like

Origin blog.csdn.net/zhe470719/article/details/121224076