[Paper Reading Notes 12]MeMOT: Multi-Object Tracking with Memory, MOT Algorithm with Memory


MeMOT is an article of CVPR2022. The biggest highlight is to store the information of all previous frames of the target, and then encode the information of all previous frames as tracking clues, so as to realize simultaneous detection and tracking. This method seems violent, and the
network The three parts are all Transformer structures (but also feature extraction by CNN). However, the idea of ​​combining all the previous frame information (maybe I want to borrow some from the batch method to help the online method) is still worth thinking about.

Paper Address: Paper

Currently not open source.

Or read the articles in order.


1. Introduction

In the introduction, the author criticized the recent MOT work on Transformer, including TrackFormer, MOTR, etc. He said that these methods put detection and tracking together (query-key mechanism, query vector represents existing trajectory and new detection features), but This combination leads to unnecessary simplifications in the association module when modeling real-time changes in the target.

The meaning of this sentence is that it is difficult for TrackFormer and other methods to reflect the real-time changes of the target in the association module (only the features of the current frame are used), in fact, it is to emphasize the importance of using the features of all time later.

In order to solve this problem, MeMOT proposes a long-term space-time memory mechanism to store the features of all past frames of the tracked target. MeMOT consists of three parts:

  1. The Hypothesis Generation Module (Hypothesis Generation Module)
    is used for feature extraction to obtain the possible location area suggestion of the target in the current frame. It can be seen that it is mainly a detection function. Implemented with (Deformable) DETR.
  2. The Memory Encoding Module
    is used to encode each tracking target into a vector, and this vector is called tracking embedding.
  3. The Memory Decoding Module (Memory Decoding Module)
    , as its name suggests, is the last stage, which takes the region proposal of the hypothesis generation module and the tracking embedding of the memory encoding module as input, and jointly performs matching.

2. Related Work

In the last part of Related Work, the author said that it may be the source of inspiration. In the sequential reasoning tasks of NLP, such as dialogue systems, there is work similar to this kind of memory. In the tasks of action recognition and video instance segmentation, additional memory is also used to store timing features. Memory is also used in some recent single-target tracking algorithms, but no one has used memory in MOT.

Let's talk about how each part is done.

3. Multi-Object Tracking with Memory

3.1 Hypothesis Generation Hypothesis

Assuming that the generation module is very simple, you can use DETR or Deformable DETR. (Deformable) DETR uses CNN to extract image features, linearly maps it into a two-dimensional vector group, and then inputs it to the Transformer encoder. At the decoder side, the input is representative of the target The query vector of , the decoder layer and the encoder layer output cross-attention calculation, and finally output a bunch of vectors, each vector represents the target feature. Then predict the bounding box and category of each vector (target).

MeMOT regards the output of (Deformable) DETR as region proposals first , which is recorded as Q prot ∈ RN prot × d \textbf{Q}_{pro}^t\in \mathbb{R}^{N_{pro }^t \times d}QprotRNprot× d , where "pro" stands for proposal,ttt represents frame t,N prot N_{pro}^tNprotRepresents the number of proposals. ddd is the feature dimension.

This is similar to Trackor. Trackor uses Faster RCNN to automatically learn the characteristics of region proposals, directly regards the region proposal of the current frame as the prediction of the current frame target, and matches it with the past trajectory. So the paper is called "Tracking without bells and whistles", which means that pure detectors can be used for tracking.
MeMOT borrows its practice and regards the output of Deformable DETR as a suggestion.

3.2 Spatial-Temporal Memory

The previous article has been saying that there is something to store all the frame characteristics of the trajectory. In fact, not all frames, because it is too long and meaningless, so it is enough to save the characteristics of some frame numbers. The author defines a FIFO (first-in-first-out) structure to store, recorded as X ∈ RN × T × d \textbf{X} \in \mathbb{R}^{N\times T \times d}XRN × T × d , whereNNN is a sufficiently large number that exceeds the number of the entire video target, for example,N = 600 N=600N=600, T T T is the time to be stored, for exampleT = 24 T=24T=24.

Therefore, X \textbf{X}X stores the features of each object at each frame.

3.3 Memory Encoding

As mentioned earlier, the memory encoding module needs to encode each target into a vector. How to do it? Now we have a bunch of features of a bunch of targets in Spatial-Temporal Memory, how to convert each target to a specific tracking embedding Woolen cloth?

The author designed three attention modules: 1) short-term block , which is used to pay attention to adjacent frames to smooth noise; 2) long-term block , which associates features of many frames to further extract features; 3) hybrid ( fusion)block , which is used to mix the output of the long and short time blocks, and output the final track embedding. The following are explained separately.

1. Short-term block

For each trajectory:
(The paper gives me the feeling that each trajectory is calculated separately)
The short-term block only focuses on the trajectory close to T s T_{s}TsThe features of the frame. Specifically, for the features of the current frame X t − 1 ∈ R d \textbf{X}^{t-1} \in \mathbb{R}^{d}Xt1Rd and nearT s T_{s}TsFeatures of the frame X t − 1 − T s : t − 1 ∈ RT s × d \textbf{X}^{t-1-T_s:t-1} \in \mathbb{R}^{T_s \times d}Xt1Ts:t1RTs×d, X t − 1 \textbf{X}^{t-1} Xt 1 asQQQ, X t − 1 − T s : t − 1 \textbf{X}^{t-1-T_s:t-1} Xt1Ts: t 1 asK , VK,VK,V performs cross-attention computation.

After calculating the attention result of each trajectory, aggregate it as the output of the short-term block, called Aggregated Short Term Token, denoted as QAST t \textbf{Q}_{AST}^tQASTt.

Q A L T t \textbf{Q}_{ALT}^t QALTtThe dimension of is unclear, and there is no hint in the paper.

2. The calculation method of the long-term block
and the short-term block is basically the same, but the long-term block uses the features of more frames for attention calculation, assuming that the number of frames concerned is T l T_lTl, guarantee that T l > T s T_l>T_sTl>Ts. Suppose T l T_lTlThe feature of the target in is X t − 1 − T l : t − 1 ∈ RT l × d \textbf{X}^{t-1-T_l:t-1} \in \mathbb{R}^{T_l \ times d}Xt1Tl:t1RTl× d , same as short-term block,X t − 1 − T l : t − 1 \textbf{X}^{t-1-T_l:t-1}Xt1Tl: t 1 is also K , VK,Vas cross-attentionK,V.

Then QQWho is Q ?

Here the author uses a loop structure (similar to LSTM and RNN) , QQQ is the output of the entire Memory Encoding, that is, Track Embedding. The author gave a new name, called dynamic memory aggregation token (dynamic memory aggregation token, DMAT), denoted as QDMAT t − 1 = { qkt − 1 } ∣ k =1 N ∈ R d × N \textbf{Q}_{DMAT}^{t-1}=\{q_{k}^{t-1}\}|_{k=1}^{N}\in \ mathbb{R}^{d\times N}QDMATt1={ qkt1}k=1NRd × N , whereNNN is the number of tracks.

Therefore, the long-term block puts X t − 1 − T l : t − 1 \textbf{X}^{t-1-T_l:t-1}Xt1Tl: t 1 asK , VK,VK,V, 把 Q D M A T t − 1 \textbf{Q}_{DMAT}^{t-1} QDMATt1as QQQ , calculate the cross-attention. Similarly, the output should also aggregate each target, recorded asQALT t \textbf{Q}_{ALT}^tQALTt.

3. Fusion module

Then, output QALT t \textbf{Q}_{ALT}^t of the short-term block and long-term blockQALTt, Q A L T t \textbf{Q}_{ALT}^t QALTtconcat together to calculate self-attention. The output is tracking embedding, which is QDMAT t − 1 \textbf{Q}_{DMAT}^{t-1}QDMATt1, which is the QQ of the long-term block of the next frameQ matrix.

The encoder module is shown in the figure below.

insert image description here
In the ablation experiment, about the long-time and short-time TTThe choice of T affects the following figure:

insert image description here
  \space  

3.4 Memory Decoding

Earlier we got the regional proposal Q prot \textbf{Q}_{pro}^t according to (Deformable) DETRQprot, the tracking embedding is obtained according to the memory encoder, recorded as Q tckt \textbf{Q}_{tck}^tQt c kt.

Actually Q tckt \textbf{Q}_{tck}^tQt c ktSum QDMAT t \textbf{Q}_{DMAT}^tQDMATtShould be a thing.

We will [ Q prot , Q tckt ] [\textbf{Q}_{pro}^t, \textbf{Q}_{tck}^t][Qprot,Qt c kt] QQas a decoderQ , put image featureszt ∈ R d × HW z_t\in \mathbb{R}^{d\times HW}ztRd × H W as decoder'sK , VK,VK,V , perform cross-attention calculation. Suppose the output of the decoder is[ Q ^ prot , Q ^ tckt ] [\hat{\textbf{Q}}_{pro}^t, \hat{\textbf{Q}}_ {tck}^t][Q^prot,Q^t c kt]

What is the output of the memory decoder? Unlike some previous methods, it not only outputs the estimation and confidence of the target position, but also outputs a probability of the estimated degree of target occlusion.

In this paper, the score measuring the degree of occlusion is called objectness score , and the score of confidence is called uniqueness score . The two scores of the i-th target at frame t are respectively represented by oit, uit o_i^t, u_i^toit,uitexpress.

The article defines the final confidence, which is the product of objectness score and uniqueness score:
sit = oituit s_i^t=o_i^t u_i^tsit=oituit

Therefore, the prediction result of the output of the decoder is two kinds of confidence plus position . For the input Q ^ prot \hat{\textbf{Q}}_{pro}^tQ^prot, the output confidence is S prot \textbf{S}_{pro}^tSprot, similarly for input Q ^ tckt \hat{\textbf{Q}}_{tck}^tQ^t c ktThe confidence of the output is S tckt \textbf{S}_{tck}^tSt c kt. Predict bbox bit ∈ R 4 \textbf{b}_i^t\in\mathbb{R}^{4} for each targetbitR4.

Like TrackFormer and other algorithms, the output corresponding to the two parts of the input is processed separately.

So the key question is what is the use of defining the new confidence in this way?

The author made two points.

  1. In the inference stage, the confidence of screening and tracking detection is considered to be sit ≥ ϵ s_i^t\ge\epsilonsitϵ .
  2. Distribution of truth confidence. For existing trajectory Q ^ tckt \hat{\textbf{Q}}_{tck}^tQ^t c ktAssign the objectness score, and then generate a suggestion for the module Q ^ prot \hat{\textbf{Q}}_{pro}^tQ^protAssign the uniqueness score, and note that the output of the assumption generation module is not only the new target, but also the old trajectory. Use bipartite graph matching to match the proposal with all output vectors. The instructions for assigning the truth value confidence are as follows:

insert image description here

3.5 Loss function

The overall loss function is similar to that in MOTR, and the loss of each track query is calculated and averaged.
insert image description here
L tcki , t L_{tck}^{i,t}Lt c ki,tIndicates the iii target atttThe tracking loss for frame t is defined as:

insert image description here
where L obj ′ L_{obj}'LobjL uni ′ L_{uni}'LuniIt is the focal loss of the confidence of objectness score and uniqueness score, bbox is L1 loss, and iou is the generalized IoU loss.

Similarly L detj , t L_{det}^{j,t}Ld e tj,tIndicates the jjthj target atttThe tracking loss for frame t is defined as:

insert image description here
Note , when calculating this loss, the author added an additional linear decoding layer to project the proposals output by the hypothesis generation module
into confidence and bbox ( that is, equivalent to calculating a (Deformable) DETR loss separately ).

4. Evaluation

The advantage of this method is that it consciously uses the information of many frames in the past to calculate the attention, which is equivalent to violently letting the model fuse the information of many frames. But it uses too many
attention structures and trains on 8 blocks of A100 Yes. It can be seen that the model is too complicated, and the effect is not as good as that of FairMOT.

The doubt is that two kinds of scores are proposed, especially the score representing visible is very innovative, but I don’t understand it. In addition to calculating the confidence of a new type and adding an item to the loss, it also plays a role What functions. Since the degree of visibility can be predicted, I think it can be further studied, such as using low threshold matching for predicted targets with low visibility, and using high threshold matching for high visibility. This is compared to using global low threshold matching (such as ByteTrack), can reduce FP.

Guess you like

Origin blog.csdn.net/wjpwjpwjp0831/article/details/124713941