[论文阅读笔记15]GNN在多目标跟踪中的应用


GNN简单来讲, 旨在通过融合顶点和边的特征进而提取出图(Graph)中的信息. 一个直觉的想法是, 在MOT中, 我们可以用顶点表示目标的特征, 边表示目标之间的关系, 进而一个构成的图就可以作为解决关联问题的一个很好的入口, GNN就可以成为解决问题的工具.

我想总结几篇经典的利用GNN做MOT的文献. 力争持续更新.


1. Learning a Neural Solver for Multiple Object Tracking(MOTSolv, Offline, CVPR2020)

解决Offline的MOT问题, 主流是依靠最小(大)流算法. 这篇文章本质上利用GNN来对最小流算法进行求解.

1.1 Abstract

摘要中说, MOT方法大家热衷于研究特征提取的策略, 然而本篇文章主要针对数据关联进行研究. 文章提出了一个消息传递网络, 来解决网络流问题. 换言之, 将网络流这个最优化问题变得可微了.

1.2 用图问题描述跟踪

我们首先来建模, 弄清图中顶点与边的含义. 假定我们有了所有的检测, 定义检测集合 O = { o i } ∣ i = 1 n \mathcal{O}=\{o_i\}|_{i=1}^n O={ oi}i=1n, 其中每个 o i o_i oiRepresents a detection, a detection contains position and time information, so let oi = { ai , pi , ti } o_i=\{a_i,p_i,t_i\}oi={ ai,pi,ti} , in whichai a_iaiRepresents the original pixel (tracking-by-detection paradigm, we have a ready-made detection box), pi p_ipiIndicates the position, ti t_itiIndicates the time of this detection. Obviously, a trajectory can be represented by a series of detections: T i = { oik } ∣ k = 1 ni T_i=\{o_{ik}\}|_{k=1}^ {n_i}Ti={ oi}k=1ni.

We now construct graph G = ( V , E ) , V = { fi } , E ⊂ V × VG=(V,E), V=\{f_i\}, E\subset V\times VG=(V,E),V={ fi},EV×V , among whichfi f_ifiCorresponding to the target feature, the meaning of the edge is whether the two detections constitute a trajectory. Specifically, we use T i = { oik } ∣ k = 1 ni T_i=\{o_{ik}\}|_{k =1}^{n_i}Ti={ oi}k=1niTo describe a trajectory is also equivalent to using the edge set { ( i 1 , i 2 ) , ( i 2 , i 3 ) , . . . , } \{(i_1,i_2),(i_2,i_3),... ,\}{(i1,i2),(i2,i3),...,} Description. According to the characteristics of the trajectory, a target can only be connected to at most one trajectory.Then using GNN to solve the minimum flow problem is actually to classify the edges and determine whether they belong to a trajectory.

If y ( i , j ) ∈ { 0 , 1 } y(i,j)\in\{0,1\}y(i,j){ 0,1}表示对边 ( i , j ) (i,j) (i,j)的分类结果, 则最小流问题叙述为:
min ⁡ ∑ ( i , j ) ∈ E c ( i , j ) y ( i , j ) s . t . y ( i , j ) ∈ { 0 , 1 } \min\sum_{(i,j)\in E}c(i,j)y(i,j) \\ s.t. \quad y(i,j)\in\{0,1\} min(i,j)Ec(i,j)y(i,j)s.t.y(i,j){ 0,1}

1.3 消息传递网络的设计

我们知道, GNN在前向传播的过程中要不断聚合顶点与边的信息, 也就是跟CNN类似的感受野的效果. 那么如何设置聚合信息的策略, 也是一个值得研究的课题.

As mentioned before, the vertex represents the target, and the edge represents the connection between the targets. Specifically, at the time of initialization, the vertex is encoded with the feature of the target. The feature is extracted by using a ready-made Re-ID network.

For edges, two features of geometry and appearance are used to measure the connection between objects. Suppose the positions of two objects are ( xi , yi , hi , wi ) , ( xj , yj , hj , wj ) (x_i,y_i, h_i,w_i), (x_j,y_j,h_j,w_j)(xi,yi,hi,wi),(xj,yj,hj,wj) , the appearance features are respectivelyfi , fj f_i, f_jfi,fj, we also need the time difference to measure the distance in time, then the initial eigenvector of the edge is:
hi , j 0 = ( 2 ( xj − xi ) hi + hj , 2 ( yj − yi ) hi + hj , log ⁡ hihj , log ⁡ wiwj , tj − ti , ∣ ∣ fi − fj ∣ ∣ 2 ) h_{i,j}^{0}=(\frac{2(x_j-x_i)}{h_i+h_j}, \frac{2( y_j-y_i)}{h_i+h_j}, \log{\frac{h_i}{h_j}},\log{\frac{w_i}{w_j}},t_j-t_i,||f_i-f_j||_2)hi,j0=(hi+hj2(xjxi),hi+hj2(yjyi),loghjhi,logwjwi,tjti,∣∣fifj2)

In the process of GNN layer-by-layer transmission, it is necessary to transmit information from edges to vertices, and from vertices to edges, first from vertices to edges, and then from edges to vertices, which is the following process:

insert image description here

Specifically, in the process of transferring vertices to edges, the features of two vertices and edge features will be concat and input into MLP N e \mathcal{N}_eNe, in the process of transmitting information from an edge to a vertex, for each edge connected to the vertex, firstly input the features of the edge calculated in the previous step and the concat of the vertex into the MLP N v \mathcal{N}_vNv, and then sum the obtained results, that is, the following two formulas:
insert image description here
There is another detail , in order to add the prior information of time in MLP, when calculating the edge-to-vertex information transmission, according to the vertex adjacent to the vertex is past Or the future is calculated separately, and always uses the initial information :

insert image description here
For example: we want to calculate the updated feature of the middle node in the figure below, and there are two nodes in the past and two nodes in the future connected to it, then use two MLPs to calculate the corresponding results of the past node and the future node respectively, after concat, and then Enter an MLP for calculation.
insert image description here

1.4 Evaluation

这是一个offline的方法, 利用GNN里比较常见的策略, 将匹配问题转化为graph中边分类问题. 做的一点小改动是将相连的顶点分为past和future. 不过迷惑的一点是, 没有看明白将edge进行分类后, 是如何满足约束条件的.

  \space  

2. GSM: Graph Similarity Model for Multi-Object Tracking(GSM, Online, IJCAI2020)

2.1 Abstract

文章主要针对直接利用外观特征匹配的不稳定性问题, 提出了用图匹配来加入拓扑关系信息的方法. 文章也是利用图匹配问题来解决两帧之间目标匹配问题, 对每个顶点都构建一个有向图, 顶点表示外观特征, 边表示位置关系特征, 随后计算两帧间每个目标的图相似度进而进行匹配. 因为有FP和FN的情况, 导致拓扑关系可能不稳定. 针对该问题, 文章提出了软匹配, 即更多地考虑了外观特征.

2.2 Method

t − 1 t-1 t1帧有M个目标, t t t帧有N个目标. 现在的任务是进行数据关联, 为此构造cost matrix C ∈ R M × N C\in\mathbb R ^{M\times N} CRM×N, 对于每个 C C The elements in C , we use the similarity between target graphs to measure, namely:
ci , j = CG ( dit − 1 , djt ) = 1 − si , j c_{i,j}=C_G(d_{i} ^{t-1}, d_{j}^{t})=1-s_{i,j}ci,j=CG(dit1,djt)=1si,j
where ddd means detection,si , j s_{i,j}si,jRepresents graph similarity (normalized).

Now we construct the graph. For the ttiiof frame ti targetdit d_{i}^{t}dit, considering its nearest K targets, construct a directed graph G = ( V , E ) G=(V,E)G=(V,E ) ,∣ V ∣ = ∣ E ∣ = K + 1 |V|=|E|=K+1V=E=K+1 ,VVEach element in V represents the appearance feature vector of the corresponding target, EEEach element in E represents the same dit d_{i}^{t}dit的拓扑关系特征, 定义为:
e i , k t = f ( r i , k t ) ∈ R 256 , r i , k t = [ x i − x k w i , y i − y k h i , log ⁡ ( h k / h i ) , log ⁡ ( w k / w i ) , x i − x k w 0 , y i − y k h 0 , log ⁡ ( ( w k − w ) / w 0 ) , log ⁡ ( ( h k − h ) / h 0 ) ] ∈ R 8 e_{i,k}^t=f(r_{i,k}^t)\in\mathbb R^{256} ,\\ r_{i,k}^t=[\frac{x_i-x_k}{w_i}, \frac{y_i-y_k}{h_i}, \log{(h_k/h_i)}, \log{(w_k/w_i)},\\ \frac{x_i-x_k}{w_0}, \frac{y_i-y_k}{h_0}, \log{((w_k-w)/w_0)}, \log{((h_k-h)/h_0)} ]\in\mathbb R^{8} ei,kt=f(ri,kt)R256,ri,kt=[wixixk,hiyiyk,log(hk/hi),log(wk/wi),w0xixk,h0yiyk,log((wkw)/w0),log((hkh)/h0)]R8
wheref : R 8 → R 256 f:\mathbb R^{8}\rightarrow \mathbb R^{256}f:R8R256 is a non-linear function of increasing dimension.w 0 , h 0 w_0,h_0w0,h0Indicates the size of the original image.

In this way, for two targets between two frames, they each have KKK nearest neighbors, we have two graphsG i , G j G_i, G_jGi,Gj, now we need to calculate the similarity of these two graphs.

Firstly, the similarity is defined as the integration of the similarity of edges and vertices, for G i G_iGimm ofm vertices andG j G_jGj的第 n n n个顶点(注意第0个顶点为我们关注的目标 i , j i,j i,j), 相似度为:

s i , j m , n = f ( concat [ ∣ ∣ v i m , t − 1 − v j n , t ∣ ∣ 2 2 , ∣ ∣ e i m , t − 1 − e j n , t ∣ ∣ 2 2 ] ) s_{i,j}^{m,n}=f(\text{concat}[||v_i^{m,t-1}-v_j^{n,t}||_2^2,||e_i^{m,t-1}-e_j^{n,t}||_2^2]) si,jm,n=f(concat[∣∣vim,t1vjn,t22,∣∣eim,t1ejn,t22])

其中 f f f是一个二分类器, 将向量映射到 [ 0 , 1 ] [0,1] [0,1].

s i , j m , n s_{i,j}^{m,n} si,jm,n构成了相似度矩阵 S i , j ∈ [ 0 , 1 ] ( K + 1 ) × ( K + 1 ) S_{i,j}\in[0, 1]^{(K+1)\times (K+1)} Si,j[0,1](K+1)×(K+1). 显然, 两个图中除了第0个顶点是我们规定的目标本身外, 其余顶点并不是一一对应的. 为此, 我们对剩余的顶点的相似度进行一次linear assignment, 这样就可以确定匹配关系进而更好地计算相似度的差异, 具体地, G i , G j G_i, G_j Gi,Gj最终的相似度定义为:

s ^ i , j = 1 K + 1 ( s i , j 0 , 0 + f L A ( S i , j ^ ) ) \hat{s}_{i,j}=\frac{1}{K+1}(s_{i,j}^{0, 0}+f_{LA}(\hat{S_{i,j}})) s^i,j=K+11(si,j0,0+fLA(Si,j^))

其中: f L A f_{LA} fLA是利用linear assignment匹配后计算相似度差异之和, 因此总共需要 1 K + 1 \frac{1}{K+1} K+11进行归一化. S i , j ^ ∈ [ 0 , 1 ] ( K ) × ( K ) \hat{S_{i,j}}\in[0, 1]^{(K)\times (K) } Si,j^[0,1](K)×(K)表示 S i , j S_{i,j} Si,j去除(0,0)元素后剩余部分.

One advantage of this design is that the linear assignment is used to partially eliminate the influence of FP and FN on the overall graph.

The overall process is as follows:

insert image description here

2.3 Evaluation

The merit of this article is to build a graph for each target, so that the topology is calculated in a very specific way, but the amount of calculation is relatively expensive.
  \space 

3. Graph Networks for Multiple Object Tracking(GNMOT, Online, WACV2020)

insert image description here

3.1 Abstract

Some current works using graph networks to solve MOT are static graphs. This article proposes dynamic graphs, and designs an appearance graph and a motion graph to calculate the relationship between trajectory and detection. Edges, vertices, and global variables are used. to update the features of the graph.

3.2 Brief introduction

This article is also a method for multi-target tracking using a graph network, but the method of this article is relatively simple. The whole is to use message passing to update edge features, and then use the Hungarian algorithm to match.

The overall flow of the algorithm is as follows:

1. According to the past frame trajectory and the current frame detection, construct two graphs: the appearance graph and the motion graph, only the trajectory and the detected edge are connected. The purpose of the
graph is to construct the cost matrix, and the construction method is, iiThe i trajectory and thejjthThe matching scores of j detections are calculated as follows:
F ( i , j ) = α Appearance G graph N network ( i , j ) + ( 1 − α ) Motion G graph N network ( i , j ) F(i , j) = \alpha AppearanceGraphNetwork(i, j) + (1 - \alpha)MotionGraphNetwork(i, j)F(i,j)=αAppearanceGraphNetwork(i,j)+(1α)MotionGraphNetwork(i,j)
F ( i , j ) F(i, j) F(i,j)就构成了匹配矩阵

2.那么两个网络是如何搭建的呢?

外观图网络:
insert image description here

每一次消息传递, 更新分为四步:对于两个顶点 v v v和其相连的边 e e e, 设置一个表示图的全局变量 u u u,

(1) 首先进行边更新, 设这部分网络为 ϕ e , 1 \phi_{e, 1} ϕe,1, 该网络的输入为两个顶点特征, 边特征, 和全局变量特征, 得到更新的边特征
(2) 再进行顶点更新, 设这部分网络为 ϕ v \phi_{v} ϕv, the input of the network is two vertex features, edge features updated in (1), and global variable features. Note that this step only updates the detected vertices, the purpose is to integrate the historical information of the trajectory into the detection. (3
)
Then To update global variables, let this part of the network be ϕ u \phi_{u}ϕu, average the features of all vertices and edges as the average feature, and uuu are input together to the network to get updateduuu
(4) Finally perform edge update, let this part of the network beϕ e , 2 \phi_{e, 2}ϕe,2, the input is updated edge, trajectory vertex, updated detection vertex, updated global

Motion graph network:
insert image description here

In the same way, only (1) (2) is missing, only (3) (4)

  1. How to train:

Appearance images train (1) and (2) (3) (4) separately, and motion images are trained together. The idea of ​​the loss function is to use cross entropy to calculate whether the matching is correct.

3.3 Evaluation

This work adopts a more traditional form of message transmission, using the trajectory calculated by the two networks and the detection score to obtain a matching matrix. In addition, trajectory management is added, that is, unmatched trajectories can be added to future matches, and unmatched SOT alone is used for detection, etc.

4. TrackMPNN: A Message Passing Graph Neural Architecture for Multi-Object Tracking(TrackMPNN, Online, arxiv2101)

4.1 Abstract

This work uses an undirected dynamic graph to represent the association of multi-target tracking. The message passing network only uses the position and category information of the target, and does not use the appearance information, so the speed is very fast. In addition, it can also deal with missing detection and so on.

4.2 Method

First define two kinds of nodes in the message passing network:

  1. Detection node: Indicates the detection of the previous frame or the current frame, and the feature representation is composed of bounding boxes and one-hot codes of categories
  2. Association node: Between two detection nodes, it indicates the possibility that the two nodes are connected (the same trajectory) (equivalent to the role of an edge)

Let's look at the pseudocode of the training and inference phases.

First look at a few functions:

  • initialize_graph(): Initialize the graph, construct a bipartite graph for the detection between two adjacent frames, and add an association node between each pair of detection nodes
  • update_graph(): Update the graph, add new detections in the current frame, and add association nodes between the current new detections and detection nodes that have not matched in the past (the meaning of the expression should be to add between new targets and unmatched targets Associated nodes, depending on the code). At the same time, remove the old vertices and edges.
  • prune_graph(): Remove low-probability points and edges to reduce memory usage
  • decode_graph(): Use greedy or Hungarian algorithm to solve the current match.
  • TrackMPNN(): constructor
  • TrackMPNN().forward(): One step forward.

Pseudocode for training phase:

insert image description here
Inference phase pseudocode:

insert image description here
Node feature update method:

1. Detection node:
The initialization method of the detection node is the vector obtained by the one-hot encoding of the bounding box position and category through a fully connected layer. This dimension is very low. When each step is updated, the detection node aggregates the previous layer The features and the features of the adjacent nodes of the previous layer are shown in the following formula:

insert image description here
We simply use the fully connected layer to predict the confidence of the detection:
insert image description here
2. Associated nodes:

The associated node is initialized as a 0 vector. At each step of update, its past features are aggregated with the past features of the two connected detection nodes. The features of the two connected detection nodes can be concat or calculated according to the fully connected layer. As shown in the following formula.

insert image description hereinsert image description here

Similarly, we simply use the fully connected layer to predict the confidence of associated nodes, which should represent the probability that two detection nodes belong to the same target.

Loss function:

Directly calculate the confidence of the detection node and the associated node according to the cross entropy (essentially a classification problem, whether the two detections belong to the same class). What is more interesting is that the video is divided into small segments for training during training. And when calculating the cross-entropy, part of it is to calculate the cross-entropy of the neighbors of the past frame and the associated nodes of the neighbors of the future frame for each detection node, as follows:

insert image description here
where N + \mathcal{N}^+N+ means the future frame,N − \mathcal{N}^-N indicates the past frame.

The other item is to use binary cross entropy:

Detection node and associated node:
insert image description here
insert image description here

The final loss function is the combination of the above three items.

5. Learnable Graph Matching: Incorporating Graph Partitioning with Deep Feature Learning for MOT(GMTracker, Online, CVPR2021)

5.1 Abstract

The article mainly addresses three questions:

  1. Existing methods ignore the context information between trajectory and detection, which is not friendly to occlusion
  2. The end-to-end data association method relies too much on DNN, and does not take advantage of the strengths of the optimization-based method. In other words, it is doubtful to use DNN to solve the optimization method
  3. Graph-based methods often require an independent GNN to extract features, which is inconvenient

The proposed solution is
to use the overall undirected graph to describe the relationship between trajectory and detection, transform the association problem into a graph matching problem , and in order to make the whole differentiable (end-to-end), the original graph The matching is relaxed to a continuous quadratic programming, which is then trained into a deep graph network using the implicit function theorem.

The idea of ​​the overall algorithm is to first use GCN to enhance the feature representation of the graph, and then use the graph matching problem to solve the association problem.

5.2 将MOT转化为图匹配形式

图匹配问题是寻找两个图的顶点映射, 使得匹配后顶点和顶点之间的边的相似度要尽量相近. 如果我们对已有的轨迹建立一个图, 再将当前帧检测建立一个图, 那么MOT的匹配问题就是如何寻找这个顶点之间映射的问题.

我们还是先来构建图. 和MOTSolv类似, 对于检测图 G D = ( V D , E D ) G_D=(V_D,E_D) GD=(VD,ED), 顶点集合 V D V_D VD表示一个检测, 边表示检测之间的相似度. 对于轨迹图 G T = ( V T , E T ) G_T=(V_T,E_T) GT=(VT,ET), 顶点集 V T V_T VT表示该轨迹检测的集合, 边也是轨迹之间的相似度. 检测图和轨迹图都是完全图. 至于相似度如何定义, 会在后文说明.

图匹配问题, 是一个二次分配问题(QAP), 可以写成K-B形式: (数学不好, 这里背后的原理不懂)
我们必须先假定, 待匹配的两个图顶点数相同, 这是图匹配问题的规定

insert image description here

(1)式:
将图匹配问题表示为该最优化问题. 其中 A A A表示图的带权邻接矩阵, Π \Pi Π represents the matching relationship, and the constraints mean: We guarantee that the matching relationship is one-to-one correspondence (bijective, 1 n 1_n1nis a matrix of all 1s). , BBB represents the affinity matrix of vertices.

Because according to the characteristics of KB form, Π \PiΠ is an orthogonal matrix, so it can be written as follows:

insert image description here

F represents the F-norm of the matrix.
This formula is more intuitive. The first item represents the difference between the edge matching relationship, that is, the edge difference. The second item represents the similarity between the vertices of the matching result. We should minimize Edge difference, maximizing point similarity.

Furthermore, it can be proved that the orthogonal matrix Π \PiΠ lies in the convex hull of the double random matrix, so we can defineΠ \PiThe search range of Π is transformed into the following optimization problem:

insert image description here
However, in the MOT problem, the weighted adjacency matrix should not be two-dimensional, it should be three-dimensional, because each element represents a feature. We assume that the features are normalized, so we expand according to the third dimension , write the optimization problem as:

insert image description here
and rewrite it as (why)?
insert image description here

In formula (5):
all lowercase letters are the straightened form of the original matrix , MMM isn 2 × n 2 n^2\times n^2n2×n2 matrix, representing all possible vertex matches.

Then according to the conclusion of (3), we can transform it into:

insert image description here
Then, the similarity between edges and vertices is measured by cosine distance (normalization is guaranteed), as shown in the following two formulas:

insert image description here
insert image description here
In a way (omitted here) the M eu , v M_e^{u,v} of (7)Meu,vMM mapped to (6)M.

5.3 GMTracker and Graph Matching Networks

5.3.1 Cross-graph GCN for Feature Enhancement

Why is it called Cross-Graph? It is because when enhancing the features, the graph convolution is done on the detection map and the trajectory map, so that the relationship between the detection and the trajectory can be paid attention to.

The target represents the vertex, and the initial feature of the vertex is the appearance feature of the Re-ID network. The edge represents the similarity between two targets. For the edge between the internal vertices in the detection graph and the trajectory graph, we define the vertices connected by the edge Feature splicing (and regularization), namely:
hi , i ′ = l 2 ( hi , hi ′ ) h_{i,i'}=l_2(h_i, h_{i'})hi,i=l2(hi,hi)
wherehhh stands for feature.

The aggregation function adopts the form of weighted sum, that is, the lliiof layer li个顶点:
m i ( l ) = ∑ j ∈ G T w i , j h j h i ( l + 1 ) = M L P ( h i ( l ) + ∣ ∣ h i ( l ) ∣ ∣ 2 m i ( l ) ∣ ∣ m i ( l ) ∣ ∣ 2 ) i ∈ G D , j ∈ G T m_i^{(l)}=\sum_{j\in G_T}w_{i,j}h_j \\ h_i^{(l+1)}=MLP(h_i^{(l)}+\frac{||h_i^{(l)}||_2m_i^{(l)}}{||m_i^{(l)}||_2}) \\ i\in G_D, j\in G_T mi(l)=jGTwi,jhjhi(l+1)=MLP(hi(l)+∣∣mi(l)2∣∣hi(l)2mi(l))iGD,jGT

How to define the weight? We use appearance features and motion features to measure together. Kalman filter is used for trajectory prediction:
mi ( l ) = cos ( hi ( l ) , hj ( l ) ) + I o U ( gi , gj ) m_i^{(l)}=cos(h_i^{(l)},h_j^{(l)})+IoU(g_i,g_j)mi(l)=cos(hi(l),hj(l))+I o U ( gi,gj)

5.3.2 Differentiable graph matching layers

The graph matching problem in MOT was described by mathematical methods, and the author wants to make this also differentiable and become a part of the network, just like DeepMOT.

The author borrowed from OptNet, which is an algorithm that uses the network to solve optimization problems. The author said that the specific proof is in the appendix, but I can't find where the appendix is!!

5.4 Evaluation

The merit of this article is to use the cross-graph GCN, while considering the detection and trajectory to update the features. And transform the matching problem into a graph matching problem (no longer limited to the bipartite graph matching of the ditch - the mode of the Hungarian algorithm), And the graph matching algorithm becomes differentiable.

There are several doubts, one is how to achieve differentiable graph matching, and the other is that the graph matching problem should require the number of vertices of the two graphs to be the same, and how to deal with them when they are different. \
  space 

6. Joint Detection and Multi-Object Tracking with Graph Neural Networks(GSDT, Online, arxiv 2021.4)

6.1 Abstract

这篇文章的亮点在于1. Joint detection and tracking 2. GNN. JDT的卖点在于数据关联可以和检测一起训练, 达到one-shot的效果.

对于过去的JDT的工作, 很多都没有考虑到object-object之间的关系. 而对于过去图网络的工作, 很多还是tracking-by-detection的(例如前两篇文献都是). 而这个工作, 是可以利用GNN的结构天然地对object-object关系进行建模, 也是JDT的, 如下图所示.
insert image description here

6.2 Method

GSDT的原理比较简单. 为了实现JDT, 最直接的方式是从同一个特征图同时进行检测和目标特征的预测. 为了实现Online, 我们应该把过去的轨迹和当前帧的检测关联起来.

整体结构如下图:
insert image description here

假设在帧 t t t, 我们有 t − 1 t-1 t1的特征图和第 t t t帧的特征图. 我们根据已有的轨迹, 利用ROIAlign将 t − 1 t-1 t1的特征图对应的位置抠出来, 这样就得到了轨迹的特征. 我们希望学习过去和现在的object-object关系, 但是现在帧我们只有特征图, 没有检测. 因此只能把当前特征图整个都当成潜在检测, 例如特征图维度是 R c × h × w \mathbb{R}^{c\times h \times w} Rc×h×w, 就逐像素flatten成 h w hw hw c c c维的向量.

我们把轨迹的特征和当前特征图flatten出的特征当作图的顶点. 由于同一帧的目标不可能相互匹配, 因此我们建图的时候只需要建立 t − 1 t-1 t1帧顶点到 t t t顶点即可. 此外, 只有相近目标关系才大, 因此边只需要连接轨迹和相近的像素. 随后GNN对特征进行更新. 然然而, 如果采用一层GNN, 则只有时间信息而没有空间信息. 因此可以采用多层, 在迭代的过程中, 特征会传播的更远, 就可以获取空间信息. GNN部分如下图:

insert image description here
经过GNN后, h w hw hw c c c维的向量已经得到了更新, 我们再重新reshape回 R c × h × w \mathbb{R}^{c\times h \times w} Rc×h×w, 这样在此新特征图上进行不同任务的预测: 位置, 形状, 外观信息.

insert image description here
得到特征后, 数据关联阶段仍然采用检测的embedding和轨迹embedding计算亲和度, 匈牙利算法匹配.

6.3 评价

GSDT uses GNN to update the feature map of the current frame, and integrates past trajectory features and spatial information. However, the past features actually only use t − 1 t-1t1 frame, so if some targets are blurred or something, the appearance information may be inaccurate, and it will be better to average a certain number of frames in the past. GNN only uses vertex information: in fact, it may be more abundant to assign features to
insert image description here
edges a little.

  \space  

7. Detection Recovery in Online Multi-Object Tracking with Sparse Graph Tracker(SGT, Online, arxiv 2022.5)

7.1 Abstract

The article starts with the problem of missed detection, and points out that a major factor of MOT performance limitation is the missed detection caused by occlusion, blur, etc. So I want to use GNN to restore the missed detection. The specific method is that nodes represent detections, and edges represent two detections The similarity between them (such as position or appearance). Then, like MOTSolv, whether the two detections belong to the same target is converted into a problem of edge classification.

This model is Online, which is performed in two adjacent frames each time, and does not require motion features and additional Re-ID networks. It is a JDT model (same as GSDT).

7.2 Introduction & Related Work

In these two parts, the author puts forward an interesting point: "It is very important to find the detection threshold that achieves the best trade-off between FP and TP." In fact, there are two main works for the previous methods of dealing with missed detections. One is ByteTrack, and the other is OMC. OMC has never seen it. ByteTrack actually improves performance by lowering the threshold and multiple matches. However, this will increase FP while reducing FN. In fact, using features like SGT to re- The form of association, on the contrary, should be better.

7.3 Method

SGT is based on FairMOT, and what it actually does is the optimization of the matching stage.

SGT actually restores the detections that are about to be discarded through edge classification. Edges represent the similarity between nodes (detections), which are measured by three indicators: position, appearance and IoU:
insert image description here
different from the above method , vertices are not represented by Re -ID feature initialization, but use the entire feature extracted by the backbone network as a vertex initialization, and all vertices are shared.

Let's take a note on the process of SGT.

insert image description here

  1. Step1. Input (should be adjacent) two frames t 1 , t 2 t_1, t_2t1,t2, we choose KK with the largest confidenceK detections (KKK should be relatively large), extract features. But thisKKAmong the K detections, there are some with high confidence and some with low confidence. If this is some previous methods, the low confidence is usually discarded. Now we need to restore the low confidence detection through GNN.
  2. Step2. Build GNN. Note that the green one is at t 1 t_1t1The previously lost detections are added to t 1 t_1t1in vertices on the side. Note that missing detections also have lifetimes.
  3. Step3. GNN forward, update the features of vertices and edges.
  4. Step4. Perform binary classification on the edge, determine whether the two detections are the same target, and use the Hungarian algorithm to match to ensure one-to-one correspondence. At this time, some detections that were originally to be discarded may be recovered because the edge classification is positive. In order to suppress FP , the vertices are still classified after restoration, and the vertices are truly restored when they are higher than the confidence level.

7.4 Evaluation

SGT is actually an improvement for the low-confidence discarding problem. Compared with ByteTrack, it is less violent, and it consciously suppresses FP through vertex classification and other methods.

8. Multiplex Labeling Graph for Near-Online Tracking in Crowded Scenes(MLG, Nearly-Online, IEEE IoTJ 2020)

8.1 Abstract

The conventional idea of ​​basically all MOT is that we need to deal with the target separately. Including if it is a graph network method, a node represents a target. However, the author believes that the scenes in the physical world are all 3D, and the 2D image representation cannot be completely Represents 3D information. That is to say, a pixel in the video does not necessarily contain only one target. If occlusion occurs in the physical world, its meaning should contain multiple targets. Based on this assumption, this paper constructs a nearly-online (only a few frames in the future are used for reference each time, in other words, the result of the current frame is given in the future frame), and each node can contain multiple targets to deal with occlusion.

8.2 Method

Because it is Nearly-Online, the sliding window method is used to build the map. For example, we already have the ttthFor the trajectory before frame t , we have the t − 1 t-1t1 frame tot+2 t+2t+2 frames for graph building. What is built is a directed graph, and the edges only point from the past to the future.

Vertex Meaning: Target
Edge Meaning: Trajectory

As mentioned earlier, a vertex can contain multiple targets to deal with the situation of occlusion. The problem of determining the target trajectory is transformed into the problem of determining the optimal path in the graph. Specifically, it is to find the weight and the maximum path as the trajectory of the target , as follows:

insert image description here

where, EEE represents the final trajectory,S i S_iSiIndicates the path iii 's score, or confidence,xi x_ixiIt is 0 or 1, which means to choose or not to choose the iii pathsS i S_iSi, the traversal is performed on all vertices, D ( vi ) D(v_i)D(vi) represents the number of targets that a vertex can represent, so that it cannot exceeddmax d_{max}dmax.

In this way, instead of the one-to-one matching method of the Hungarian algorithm, one node can represent multiple targets, and a more accurate trajectory under occlusion can be obtained, as shown in the following figure:

insert image description here
In this paper, the edge weight score (whether two nodes belong to the same target) is classified through LSTM , and many works are similar, so I won't repeat them here.

In the pseudocode of MLG, the algorithm flow is roughly explained. After the window slides, build edges between the previous end node (out-degree 0) and all current initial nodes (in-degree 0), and filter out the confidence The edge with a small degree, and then sieve into the edge with a degree greater than dmax d_{max}dmax(That is, the number of targets represented is greater than dmax d_{max}dmaxside), the optimal solution of the above formula is reached (greedy algorithm).

insert image description here

9. Unifying Short and Long-Term Tracking with Graph Hierarchies(SUSHI, Offline, CVPR2023)

overview

This article is an improvement of Learning a Neural Solver for Multiple Object Tracking, which is to merge the trajectories recursively , as shown in the following figure:

insert image description here

After the small trajectory is merged into the large trajectory, the new node represents the merged trajectory, as shown in the figure below,

insert image description here
This reduces the amount of computation. Each SUSHI block is actually a GNN. Nodes represent trajectories, and edges represent matching relationships. After several layers of message passing to update features, MLP is used to determine the possibility of the relationship represented by the edge. Node characteristics Initialized as a zero vector, the edge features are composed of associated features (Re-ID features, position time relationship, etc.). The design that needs to be noted is that all SUSHI Blocks share features. In order to distinguish blocks of different layers, the author added a level embedding, which allows side features to encode the expected specific feature differences at each level. Another benefit of sharing features is that it virtually expands the amount of training data.

In terms of implementation details, the author adopts four association levels , which are 5, 25, 75 and 150 frames respectively. For each node when building a map, select the 15 nearest neighbor nodes as edges, and the distance measure of the nearest neighbor includes geometry, appearance similarity to sports.

9. Compare

algorithm Online/Offline Paradigm the whole idea graphic form apex meaning side meaning
COUNTERSolv Offline TBD GNN solves the minimum flow problem complete graph Target Re-ID feature target similarity
GSM Online TBD A graph is established for each target, and the similarity between targets is measured by graph similarity (directed) sparse graph Target Re-ID feature Topological (location) similarity
GNMOT Online TBD Construct the appearance map and matching map of detection and trajectory, calculate the score of each pair of matching, and then solve the best matching bipartite graph Appearance and movement characteristics of the target Likelihood of belonging to the same target
TrackMPNN Online TBD Build detection nodes and associated nodes, associated nodes represent matching relationships, and calculate associated node scores for matching bipartite graph Detection nodes: location features and categories; Association nodes: possibility of belonging to the same object -
GMTracker Online TBD Transform the matching problem into a graph matching problem (vertex mapping relationship), and use the implicit function theorem to make the matching layer differentiable complete graph Target Re-ID feature target similarity
GSDT Online JDT Use GNN to update the current feature map, and perform different tasks with different heads sparse graph One side is the Re-ID feature of the trajectory, and one side is the pixel feature of the feature map -
SGT Online JDT Use edge classification to decide whether two detections belong to the same object, so as to recover low confidence detections sparse graph The entire feature map target similarity
MLG Nearly-Online TBD A node can represent multiple targets, transforming the matching problem into a maximum flow problem directed sparse graph Target path
SUSHI Offline TBD hierarchical recursive association Undirected graph target (trajectory) target similarity

Guess you like

Origin blog.csdn.net/wjpwjpwjp0831/article/details/125470648