【GNN】Tensorflow implementation of space-time graph network

Based on blogger shyern's blog post, this article aims to introduce the principle and implementation of STGCN, and tensorflow 2realize .

By extending the graph convolutional network to the spatio-temporal graph model, the author designed a general representation (skeleton sequence for action recognition), called the spatio-temporal graph convolutional network (Spatio-Temporal Graph Convolutional Networks, STGCN).

Graph Construction

Remember one with NNN nodes andTTThe space-time diagram of the bone sequence of the T frame is G = ( V , E ) G=(V,E)G=(V,E ) , its node set isV = { vti ∣ t = 1 , … , T , i = 1 , . . . , N } V=\left\{v_{ti} | t=1, \ldots, T, i=1,...,N\right\}V={ vt it=1,,T,i=1,...,N } , thettiiof frame tThe feature vectorF ( vti ) F\left(v_{ti}\right) of i nodesF(vt i) consists of the coordinate vector of the node and the estimated confidence.

The graph structure consists of two parts:

  • According to the human body structure, connect the nodes of each frame into edges, and these edges form spatial edges

    E S = { v t i v t j ∣ ( i , j ) ∈ H } E_{S}=\left\{v_{t i} v_{t j} |(i, j) \in H\right\} ES={ vt ivtj(i,j)H}

    H H H is a set of naturally connected human joints.

  • Connect the same nodes in two consecutive frames into edges, and these edges form temporal edges

    E F = { v t i v ( t + 1 ) i } E_{F}=\left\{v_{t i} v_{(t+1) i}\right\} EF={ vt iv(t+1)i}

Spatio-temporal graph built from articulation points

Spatial Graph Convolutional Neural Network

For spatial graph convolution, the graph convolution model operation is only discussed for a certain frame here . Taking the two-dimensional convolution of common images as an example, for a certain position x \mathbf{x}The convolution output of x can be written as follows

f o u t ( x ) = ∑ h = 1 K ∑ w = 1 K f i n ( p ( x , h , w ) ) ⋅ w ⋅ w ( h , w ) f_{o u t}(\mathbf{x})=\sum_{h=1}^{K} \sum_{w=1}^{K} f_{i n}(\mathbf{p}(\mathbf{x}, h, w)) \cdot \mathbf{w} \cdot \mathbf{w}(h, w) fout(x)=h=1Kw=1Kfin(p(x,h,w))ww(h,w)

The number of input channels is ccThe feature mapfin f_{in} of cfin, convolution kernel size K × KK\times KK×K ,sampling
function
equationp ( x , h , w ) = x + p ′ ( h , w ) \mathbf{p}(\mathbf{x}, h, w)=\mathbf{x}+\mathbf{ p}^{\prime}(h, w)p(x,h,w)=x+p(h,w ) ,the weight functionchannel number isccThe weight function of c .

  • Sampling function

In the image, the sampling function p ( h , w ) \mathbf{p}(h,w)p(h,w ) refers tothe xxThe neighbor pixels around the x pixel as the center, in the figure, the set of neighbor pixels is defined as:

B ( v t i ) = { v t j ∣ d ( v t j , v t i ) ≤ D } B\left(v_{t i}\right)=\left\{v_{t j} | d\left(v_{t j}, v_{t i}\right) \leq D\right\} B(vt i)={ vtjd(vtj,vt i)D}

Among them, d ( vtj , vti ) d(v_{tj},v_{ti})d(vtj,vt i) refers tov_{tj} from vtjvtjvti v_{ti}vt iThe shortest distance of , so the sampling function can be written as

p ( v t i , v t j ) = v t j \mathbf{p}\left(v_{t i}, v_{t j}\right)=v_{t j} p(vt i,vtj)=vtj

  • Weight function

In 2D convolution, neighbor pixels are regularly arranged around the central pixel, so it can be convoluted with regular convolution kernels according to the spatial order. Analogous to 2D convolution, in the figure, the neighbor pixels obtained by the sampling function are divided into different subsets, and each subset has a digital label, so there is

l t i : B ( v t i ) → { 0 , … , K − 1 } l_{t i} : B\left(v_{t i}\right) \rightarrow\{0, \ldots, K-1\} lt i:B(vt i){ 0,,K1}

Map a neighbor node to the corresponding subset label, and the weight equation is

w ( v t i , v t j ) = w ′ ( l t i ( v t j ) ) \mathbf{w}\left(v_{t i}, v_{t j}\right)=\mathbf{w}^{\prime}\left(l_{t i}\left(v_{t j}\right)\right) w(vt i,vtj)=w(lt i(vtj))

  • Spatial Graph convolution

f o u t ( v t i ) = ∑ v t j ∈ B ( v t i ) 1 Z t i ( v t j ) f i n ( p ( v t i , v t j ) ) ⋅ w ( v t i , v t j ) f_{o u t}\left(v_{t i}\right)=\sum_{v_{t j} \in B\left(v_{t i}\right)} \frac{1}{Z_{t i}\left(v_{t j}\right)} f_{i n}\left(\mathbf{p}\left(v_{t i}, v_{t j}\right)\right) \cdot \mathbf{w}\left(v_{t i}, v_{t j}\right) fout(vt i)=vtjB(vt i)Zt i(vtj)1fin(p(vt i,vtj))w(vt i,vtj)

Among them, the normalization item

Z t i ( v t j ) = ∣ { v t k ∣ l t i ( v t k ) = l t i ( v t j ) } ∣ Z_{t i}\left(v_{t j}\right)=\left|\left\{v_{t k} | l_{t i}\left(v_{t k}\right)=l_{t i}\left(v_{t j}\right)\right\}\right| Zt i(vtj)={ vtklt i(vtk)=lt i(vtj)}

Equivalent to the basis of the corresponding subset. Put the above formula into the above formula to get:

f o u t ( v t i ) = ∑ v t j ∈ B ( v t i ) 1 Z t i ( v t j ) f i n ( v t j ) ⋅ w ( l t i ( v t j ) ) f_{o u t}\left(v_{t i}\right)=\sum_{v_{t j} \in B\left(v_{t i}\right)} \frac{1}{Z_{t i}\left(v_{t j}\right)} f_{i n}\left(v_{t j}\right) \cdot \mathbf{w}\left(l_{t i}\left(v_{t j}\right)\right) fout(vt i)=vtjB(vt i)Zt i(vtj)1fin(vtj)w(lt i(vtj))

  • Spatial Temporal Modelling

Extending the model of the space domain to the time domain, the obtained sampling functionis

B ( v t i ) = { v q j ∣ d ( v t j , v t i ) ≤ K , ∣ q − t ∣ ≤ ⌊ Γ / 2 ⌋ } B\left(v_{t i}\right)=\left\{v_{q j}\left|d\left(v_{t j}, v_{t i}\right) \leq K,\right| q-t | \leq\lfloor\Gamma / 2\rfloor\right\} B(vt i)={ vqjd(vtj,vt i)K,qtΓ/2}

where, Γ \GammaThe Γ control controls the size of the convolution kernel in the time domain.

weight functionfor

l S T ( v q j ) = l t i ( v t j ) + ( q − t + ⌊ Γ / 2 ⌋ ) × K l_{S T}\left(v_{q j}\right)=l_{t i}\left(v_{t j}\right)+(q-t+\lfloor\Gamma / 2\rfloor) \times K lST(vqj)=lt i(vtj)+(qt+Γ/2)×K

Among them, lii l_{ii}liiLabel mapping for the single frame case.

At this point, there is a well-defined convolution operation on the constructed spatiotemporal graph.

Partition Strategies

  • Uniquely divide Uni-labelling: divide the 1-neighborhood of nodes into a subset
  • Distance-based division Distance partitioning: divide the 1 neighborhood of the node into two subsets, the subset of the node itself and the subset of adjacent nodes
  • Spatial configuration partitioning: Divide the 1 neighborhood of the node into 3 subsets. The first subset connects the neighbor nodes that are farther away from the entire skeleton than the root node in space, and the second subset connects the nodes closer to the center. Neighbor nodes, the third subset is the root node itself, which respectively represent the motion characteristics of centrifugal motion, centripetal motion and static motion

Partition Strategies

Learnable edge importance weighting

During the movement, the importance of different trunks is different. For example, the movement of the legs may be more important than the neck. We can even judge running, walking and jumping through the legs, but the movement of the neck may not contain much effective information.

Therefore, STGCN weights different torsos (each STGCN unit has its own weight parameters for training). After adding the attention mechanism

A j = A j ⊗ M \mathbf{A}_{j}=\mathbf{A}_{j} \otimes \mathbf{M}Aj=AjM

where ⊗ \otimes means dot product, maskM \bf MM is initialized astf.ones.

Implementing STGCN

GCN helps us learn local features of adjacent joints in space. On this basis, we need to learn the local features of joint changes in time. **How ​​to superimpose timing features for Graph is one of the problems faced by the graph network. **There are two main ideas for research in this area: Time Convolution (TCN) and Sequence Model (LSTM).

STGCN uses TCN. Since the shape is fixed, we can use traditional convolutional layers to complete temporal convolution operations. For ease of understanding, the convolution operation of an image can be compared.

fout = Λ − 1 2 ( A + I ) Λ − 1 2 fin W \mathbf{f}_{out}=\mathbf{\Lambda}^{-\frac{1}{2}}(\mathbf{A }+\mathbf{I}) \mathbf{\Lambda}^{-\frac{1}{2}} \mathbf{f}_{in} \mathbf{W}fout=L21(A+I)Λ21finW

其中, Λ i i = ∑ j ( A i j + I i j ) \Lambda^{i i}=\sum_{j}\left(A^{i j}+I^{i j}\right) Lii=j(Aij+Iij)

Note: The author uses pytorchimplementation , so it tensor shapeis ( N , C , H , W ) (N, C, H, W)(N,C,H,W ) , and tensorflowthetensor shapeis( N , H , W , C ) (N,H,W,C)(N,H,W,C ) . Interested readers can compare the difference between the two, and the latter is used in the following expressions.

The shape of the last three dimensions of STGCN's feature map is ( T , V , C ) (T, V, C)(T,V,C ) , with the shape of the image feature map( H , W , C ) (H, W, C)(H,W,C ) corresponds to.

  • Image channel number CCC corresponds to the characteristic number CCof the jointC
  • Image width WWW corresponds to the number of key framesVVV
  • High HH of the imageH corresponds to the number of jointsTTT

Spatiotemporal graph convolution

Spatial graph convolution model

In spatial graph convolution, the size of the convolution kernel is w × 1 w \times 1w×1 , the convolution of w rows of pixels and 1 column of pixels is completed each time. strideforsss , then movesss pixels, after one row is completed, the convolution of the next row of pixels is performed.

# The based unit of graph convolution networks.
import tensorflow as tf

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Conv2D, Reshape

class gcn(Model):
    r"""The basic module for applying a graph convolution.

    Args:
        filters (int): Number of channels produced by the convolution
        kernel_size (int): Size of the graph convolution kernel
        t_kernel_size (int): Size of the temporal convolution kernel
        t_stride (int, optional): Stride of the temporal convolution. Default: 1
        t_padding (int, optional): Temporal zero-padding added to both sides of
            the input. Default: 0
        t_dilation (int, optional): Spacing between temporal kernel elements.
            Default: 1
        bias (bool, optional): If ``True``, adds a learnable bias to the output.
            Default: ``True``

    Shape:
        - Input[0]: Input graph sequence in :math:`(N, T_{in}, V, in_channels)` format
        - Input[1]: Input graph adjacency matrix in :math:`(K, V, V)` format
        - Output[0]: Output graph sequence in :math:`(N, T_{out}, V, out_channels)` format
        - Output[1]: Graph adjacency matrix for output data in :math:`(K, V, V)` format

        where
            :math:`N` is a batch size,
            :math:`T_{in}/T_{out}` is a length of input/output sequence,
            :math:`V` is the number of graph nodes,
            :math:`K` is the spatial kernel size, as :math:`K == kernel_size[1]`.
    """

    def __init__(self,
                 filters,
                 t_kernel_size=1,
                 t_stride=1,
                 t_padding=0,
                 t_dilation=1,
                 bias=True):
        super(gcn, self).__init__(dynamic=True)

        self.filters = filters
        self.t_kernel_size = t_kernel_size // 2 * 2 + 1
        self.t_padding = t_padding
        self.t_stride = t_stride
        self.t_dilation = t_dilation
        self.bias = bias
        self.k_size = None

        self.conv = None
        self.reshape = None

    def build(self, input_shape):
        x_shape, A_shape = input_shape

        self.k_size = A_shape[0]
        self.conv = Conv2D(
            filters=self.filters * self.k_size,
            kernel_size=(self.t_kernel_size, 1),
            padding='same' if self.t_padding else 'valid',
            strides=(self.t_stride, 1),
            dilation_rate=(self.t_dilation, 1),
            use_bias=self.bias,
            input_shape=x_shape)

        n, t, v, c = self.conv.compute_output_shape(x_shape)
        self.reshape = Reshape([t, v, self.k_size, c // self.k_size])

    def call(self, inputs, training=None, mask=None):
        x, A = inputs

        h = self.conv(x)
        h = self.reshape(h)
        y = tf.einsum('ntvkc, kvw->ntwc', h, A)

        return y

Temporal convolution model

In temporal convolution, the size of the convolution kernel is temporal _ kernel _ size × 1 temporal\_kernel\_size \times 1temporal_kernel_size×1 , the convolution of one node andtemporal_kernel_sizeone keyframe is completed each time. ssIf s is 1, it moves 1 frame each time, and performs convolution of the next node after completing 1 node.

# The based unit of graph temporal convolution networks.

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Conv2D, Dropout, BatchNormalization, Lambda, Activation

class tcn(Model):
    r"""The basic module for applying a temporal convolution.

    Args:
        input_A (float): Graph adjacency matrix for output data in :math:`(K, V, V)` format
        filters (int): Number of channels produced by the convolution
        kernel_size (int): Size of the graph convolution kernel
        t_kernel_size (int): Size of the temporal convolution kernel
        t_stride (int, optional): Stride of the temporal convolution. Default: 1
        t_padding (int, optional): Temporal zero-padding added to both sides of
            the input. Default: 0
        t_dilation (int, optional): Spacing between temporal kernel elements.
            Default: 1
        bias (bool, optional): If ``True``, adds a learnable bias to the output.
            Default: ``True``

    Shape:
        - Input[0]: Input graph sequence in :math:`(N, T_{in}, V, in_channels)` format
        - Output[0]: Output graph sequence in :math:`(N, T_{out}, V, out_channels)` format

        where
            :math:`N` is a batch size,
            :math:`T_{in}/T_{out}` is a length of input/output sequence,
            :math:`V` is the number of graph nodes,
            :math:`K` is the spatial kernel size, as :math:`K == kernel_size[1]`.
    """

    def __init__(self,
                 filters,
                 t_kernel_size=1,
                 t_stride=1,
                 t_padding=0,
                 in_batchnorm=True,
                 out_batchnorm=True,
                 t_dilation=1,
                 bias=True,
                 dropout=0):
        super(tcn, self).__init__()

        self.filters = filters
        self.t_kernel_size = t_kernel_size
        self.t_padding = t_padding
        self.t_stride = t_stride
        self.t_dilation = t_dilation
        self.dropout = dropout
        self.bias = bias

        if in_batchnorm:
            self.batch_1 = BatchNormalization()
        else:
            self.batch_1 = Lambda(lambda x: x)

        self.conv = None
        self.a = Activation('relu')

        if out_batchnorm:
            self.batch_2 = BatchNormalization()
        else:
            self.batch_2 = Lambda(lambda x: x)

        self.dropout = Dropout(dropout)

    def build(self, input_shape):
        self.conv = Conv2D(filters=self.filters,
                           kernel_size=(self.t_kernel_size, 1),
                           padding='same' if self.t_padding else 'valid',
                           strides=(self.t_stride, 1),
                           dilation_rate=(self.t_dilation, 1),
                           use_bias=self.bias,
                           input_shape=input_shape)


    def call(self, inputs, training=None, mask=None):
        x = inputs

        h = self.batch_1(x)
        h = self.a(h)
        h = self.conv(h)
        h = self.batch_2(h)
        y = self.dropout(h)

        return y

Network architecture

The input data is first processed batch normalization, and then after 9 STGCNunits , followed by a global poolingto get the 256-dimensional feature vector of each sequence, and finally use SoftMaxthe function to classify to get the final label. For STGCNeach Resnetstructure that adopts , the output of the first three layers has 64 channels, the middle three layers have 128 channels, and the last three layers have 256 channels. After passing through the ST-CGN structure each time, the features are randomly selected with a probability of 0.5 dropout. The strides of the 4th and 7th temporal convolutional layers are set to 2. SGDTraining with , the learning rate is 0.01, and the learning rate decreases by 0.1 every 10 epochs.

This paper implements a basic STGCNunit , which interested readers can use to design spatio-temporal graph networks and apply them to practical tasks.

# The based unit of spatial temporal module.

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Activation, Lambda

from models.gcn import gcn
from models.tcn import tcn
from models.res import res

class stgcn(Model):
    r"""Applies a spatial temporal graph convolution over an input graph sequence.

    Args:
        filters (int): Number of channels produced by the convolution
        kernel_size (tuple): Size of the temporal convolution kernel
                                & graph convolution kernel
        stride (int, optional): Stride of the temporal convolution. Default: 1
        dropout (int, optional): Dropout rate of the final output. Default: 0
        residual (bool, optional): If ``True``, applies a residual mechanism.
                                Default: ``True``

    Shape:
        - Input[0]: Input graph sequence in :math:`(N, in_channels, T_{in}, V)` format
        - Input[1]: Input graph adjacency matrix in :math:`(K, V, V)` format
        - Output[0]: Output graph sequence in :math:`(N, out_channels, T_{out}, V)` format

        where
            :math:`N` is a batch size,
            :math:`T_{in}/T_{out}` is a length of input/output sequence,
            :math:`V` is the number of graph nodes,
            :math:`K` is the spatial kernel size, as :math:`K == kernel_size[1]`.

    """
    def __init__(self,
                 filters,
                 kernel_size,
                 stride=1,
                 dropout=0,
                 residual=True):
        super(stgcn, self).__init__(dynamic=True)

        assert len(kernel_size) == 2
        assert kernel_size[0] % 2 == 1
        padding = (kernel_size[0] - 1) // 2

        self.residual = residual
        self.filters = filters
        self.kernel_size = kernel_size
        self.res = None
        self.stride = stride

        self.gcn = gcn(filters=filters,
                       t_kernel_size=kernel_size[1])

        self.tcn = tcn(filters=filters,
                       t_kernel_size=kernel_size[0],
                       t_stride=stride,
                       t_padding=padding,
                       dropout=dropout)

        self.a = Activation('relu')

    def build(self, input_shape):
        x_shape, _ = input_shape
        c = x_shape[-1]

        if not self.residual:
            self.res = Lambda(lambda x: 0)
        elif c == self.filters and self.stride == 1:
            self.res = Lambda(lambda x: x)
        else:
            self.res = res(filters=self.filters,
                           kernel_size=1,
                           stride=self.stride)

    def call(self, inputs, training=None, mask=None):
        x, A = inputs

        res = self.res(x)
        x = self.gcn([x, A])
        x = self.tcn(x) + res
        y = self.a(x)

        return y

Summarize

The extended application of GCN in time and space is pytorchimplemented , and the idea is worth learning. This article uses for reference to tensorflow 2implement the basic STGCNunit, hoping to develop more versions of STGCN based on the author's ideas and use it for specific domain problems.

Guess you like

Origin blog.csdn.net/qq_38904659/article/details/113469272