[Paper Reading Notes 24] Social-STGCNN: A Social Spatio-Temporal GCNN for Human Traj. Pred.


Thesis: Thesis address

Code: code address

In this article, the author directly uses GNN to model the spatio-temporal characteristics of the trajectory of the target, and uses time-series CNN to make predictions, replacing the RNN-like method that is more difficult to train and slower in speed.


0. Abstract

Pedestrian trajectory prediction is a challenging task with many applications. The trajectory of a pedestrian is not only determined by itself, but also affected by the surrounding objects. The past methods are to learn the motion state of each pedestrian, However, the method in this paper is to use a GNN to model the interaction between pedestrians in the entire scene. The method proposed in this paper is called Social-STGCNN, which is based on STGCNN (a skeleton-based action recognition method) and extends to trajectory prediction. in task.

1. Introduction

Some methods in the past (such as Social-LSTM) assigned a cyclic structure neural network (lstm) to predict the trajectory for each pedestrian. In addition, there are some methods that use GAN to generate future trajectories. But the author believes that these methods The training cost is quite high, can a unified network be used to model the interaction between pedestrians.

The author also wrote another paragraph to analyze why the past network was suboptimal in principle. It is mainly the following two aspects:

  1. Use a separate network for each pedestrian to predict, and use pooling to measure the interaction between pedestrians. This method lacks interpretability. In contrast, this paper uses graph, which can naturally measure the relationship between nodes The structure is intuitively interpretable.
  2. Pooling will cause loss of information.

Therefore, the author proposes Social-STGCNN to solve the above two problems. The author uses a GNN with spatial-temporal information to measure interaction, and explicitly models the influence between targets to form adjacency Matrix, and then use graph convolution for further feature extraction. Finally, the author uses temporal CNN to predict the trajectory.

2. Related Work

This part mainly includes three aspects: the past work of trajectory prediction, the work of graph convolution and the work of temporal CNN.

3. Method

The entire Social-STGCNN consists of two parts, one is the STGCNN part for extracting spatio-temporal features, and the other is the temporal CNN (TXP-CNN) part for predicting trajectories.

3.1. Airspace Mapping

For the ttt frame, we consider as thetttht几建图G t = ( V t , E t ) G_t=(V_t,E_t)Gt=(Vt,Et) . We express the coordinates of each point in the screen as node features:

V t = { v t i } ∣ i = 1 N ,    v t i = ( x t i , y t i ) V_t = \{v_t^i\}|_{i=1}^N, ~~v_t^i=(x_t^i, y_t^i) Vt={ vti}i=1N,  vti=(xti,yti)

e t i j e_t^{ij} etijjust means node iii andjjWhether j is connected. However, for the adjacency matrix A t = [ asim , tij ] A_t=[a_{sim, t}^{ij}]At=[asim,tij] , defined by the Euclidean distance between nodes:

insert image description here

After building the graph, we can get updated node features through the graph convolution layer. The formula of graph convolution is as follows:

insert image description here
where B ( ⋅ ) B(\cdot)B ( ) represents the set of neighbor nodes,p ( ⋅ ) p(\cdot)p ( ) represents the aggregation function,w ( ⋅ ) \mathbf{w}(\cdot)w ( ) represents the convolution kernel.

Note that B ( ⋅ ) B(\cdot)B ( ) is defined by the shortest path:

B ( vi ) = { vj ∣ d ( vi , vj ≤ D ) } B(v^i) = \{v^j|d(v^i, v^j\le D)\}B(vi)={ vjd(vi,vjD)}

where ddd means the shortest path.

3.2. Time Domain Mapping

We perform the above mapping for each frame, for TTT frame, you can get a space-time graphG = ( V , E ) G=(V,E)G=(V,E ) . whereV = { vi } V=\{v^i\}V={ vi}, v i = { v t i } ∣ t v^i=\{v_t^i\}|_t vi={ vti}t. The same is true for edges. The same is true for adjacency matrices.

3.3. Trajectory Prediction

After obtaining the spatio-temporal node embedding features, the time-series CNN performs feature extraction on the embedding from the time dimension to predict future trajectories.

The whole block diagram is as follows:

insert image description here

3.4. Implementation

When implementing, it is necessary to normalize the graph using the Lapalace matrix of the graph, and then perform convolution. This is a conventional practice, as follows:

A t = Λ t − 1 / 2 ( A t + I ) Λ t 1 / 2 , Λ t = d i a g ( A t ) A_t = \Lambda_t^{-1/2}(A_t+I)\Lambda_t^{1/2}, \Lambda_t = diag(A_t) At=Lt1/2(At+I)Λt1/2,Lt=d ia g ( At)

4. Experiment

In the ablation experiment part, the author compared the following three ways of constructing the adjacency matrix, and found that the simple Euclidean distance is the best:

insert image description here
Through the following experimental comparison, it is found that the speed is indeed much faster:
insert image description here

Guess you like

Origin blog.csdn.net/wjpwjpwjp0831/article/details/131860348