[Thesis Reading Notes 14]Online MOT with Unsupervised Re-ID Learning and Occlusion Estimation


Paper Address: Paper

The reason why I want to take notes on this article is because this article is a rare work specifically for occlusion situations . In addition, more and more algorithms recently use self-supervised (unsupervised) methods (this article, and Wang et al's Multiple object tracking with correlation learning) or comparative learning (QDTrack) to learn target features . The advantage of this method is that it does not need to convert feature learning into a large number of label classification problems like JDE or FairMOT. Because the data set If it is very large, the number of targets will be very large, so it may not be very effective to classify a large number of vectors with finite dimensions.

In addition, this paper proposes a very interesting strategy to deal with occlusion. That is, using a method similar to a heat map, a separate branch is established to predict the possibility of mutual occlusion of targets at a certain position in the picture, and then in the association stage, an additional matching step is added , to deal with the restoration of the occluded target.

The Related Works of this article is well written, so I will mainly explain this part and the method part.


1. Related Works

The article addresses two major issues, one is unsupervised Re-ID, and the other is occlusion estimation, based on FairMOT improvements. Therefore, the Related Works section focuses on the work related to Re-ID and occlusion processing.

1.1 Re-ID related work

Re-ID can be roughly divided into supervised and unsupervised. For supervised, some methods extract the detected target, and then extract features again. This is a waste of time and calculation. In order to solve this problem, some methods put Re-ID and backbone features are shared, such as JDE and FairMOT.

For unsupervised, there are roughly two ways. One is to use pseudo ID (persuade ID), which is obtained by clustering or tracking. However, the number of such IDs is not necessarily accurate, and errors will accumulate. The second is ID free , that is, there is no need to estimate the number of IDs. This article adopts the latter method.

1.2 Relevant work on occlusion response

The current MOT algorithm can be divided into two ways to deal with occlusion:
1. Deal with occlusion in the detection stage. For example, detect multiple targets in a proposal box (proposal), or detect the head and body of pedestrians separately. However, this method needs to be cautious To design the NMS algorithm.

2. Deal with occlusion in the tracking phase. The most direct way is to use SOT to deal with it. For example, some methods create a SOT for each target. Another way is to assume that the topological relationship of adjacent frames remains unchanged, and use the topological relationship of the neighborhood Find leaks.

2. Method

This work is mainly divided into two aspects, 1. Unsupervised Re-ID 2. Occlusion estimation. The following are explained separately.

2.1 Design of Unsupervised Re-ID

The unsupervised overall optimization goal is roughly to make the feature representations of the same target as similar as possible, and the feature representations of different targets be as irrelevant as possible.

Suppose there are two adjacent frames I t − 1 , I t I_{t-1}, I^tIt1,It , where the number of targets isN t − 1 N^{t-1}Nt 1 andN t N^tNt . Of course, there is a high probability that there are many repeated targets in the two adjacent frames. Regardless of this, we first connect the targets of the two frames, so that we have a total ofN t − 1 + N t N^{t-1} +N^tNt1+Nt goals.How to design an unsupervised learning method? The following two major principles should be followed:

  1. Targets in the same frame cannot match each other, which is a strong constraint.
  2. There are many duplicate targets in the two frames, so the target in one frame has a high probability to find a match in the other frame. This is a weak constraint.

Regarding the matching relationship, it is natural that we can use a matrix to express it. If we have two frames with a total of N t − 1 + N t N^{t-1}+N^tNt1+NThe matching relationship of t targets is written as a matrix, as shown in the figure below:
insert image description here

In the figure above, the yellow area indicates intra-frame matching, which is not allowed according to strong constraints. The blue area indicates matching between two frames, which is a weak constraint . Obviously this matrix is ​​a symmetric matrix.

We denote this matrix as S ∈ R ( N t − 1 + N t ) × ( N t − 1 + N t ) S\in\mathbb{R}^{(N^{t-1}+N^t )\times (N^{t-1}+N^t)}SR(Nt1+Nt)×(Nt1+Nt ), we make all the elements on the diagonal negative infinity (according to the strong constraints, they cannot match themselves within the frame), and the remaining elements are the cosine similarity of the vector, namely:

S i , j = cos ⁡ < f i , f j > , if i ≠ j , = − ∞ , otherwise S_{i,j}=\cos{<f_i, f_j>}, \text{if} i\ne j, \\ = -\infty ,\text{otherwise} Si,j=cos<fi,fj>,ifi=j,=,otherwise

Obviously, we want to represent S i , j S_{i,j} of the same targetSi,jshould be as large as possible, S i , j for different targets S_{i,j}Si,jshould be as small as possible. We can equivalently convert to a probability representation, that is, by row pair SSS as softmax:

M = softmax ( S ) , M i , j = e S i , j T ∑ j e S i , j T M=\text{softmax}(S), \\ M_{i,j}=\frac{e^{S_{i,j}T}}{\sum_j e^{S_{i,j}T}} M=softmax(S),Mi,j=jeSi,jTeSi,jT

其中 T = 2 log ⁡ ( N t − 1 + N t + 1 ) T=2\log{(N^{t-1}+N^t+1)} T=2log(Nt1+Nt+1 ) is the parameter of the softmax function.

According to strong constraints, the targets in the frame should not match each other, so the value in the yellow area should be as small as possible, which is equivalent to its sum as small as possible, so define the loss in the frame:

L inside = ∑ 0 ≤ i , j ≤ N t − 1 M i , j + ∑ N t − 1 ≤ i , j ≤ N t + N t − 1 M i , j L_{id}^{intra}=\ sum_{0\le i,j \le N^{t-1}}M_{i,j}+\sum_{N^{t-1}\le i,j \le N^t+N^{t -1}}M_{i,j}Lidintra=0i,jNt1Mi,j+Nt1i,jNt+Nt1Mi,j

For weak constraints, we hope that the higher the confidence of the match, the better. That is, for the target id iii , the most likely match isj ∗ = arg ⁡ max ⁡ j M i , jj^*=\arg\max_j{M_{i,j}}j=argmaxjMi,j. We want this j ∗ j^*jThe confidence of ∗ should be as large as possible, and the rest of the confidence should be as small as possible. Therefore, the inter-frame loss is defined as:

L i d i n t e r = ∑ i max ⁡ ∀ j ′ ≠ j ∗ { max ⁡ { M i , j ′ + m − M i , j ∗ } , 0 } L_{id}^{inter}=\sum_i\max_{\forall j'\ne j^*}\{\max\{M_{i,j'}+m-M_{i,j^*}\},0\} Lidinter=ij=jmax{ max{ Mi,j+mMi,j},0}

The meaning of this formula is to let max ⁡ { M i , j ′ \max\{M_{i,j'}max{ Mi,jAs small as possible, M i , j ∗ M_{i,j^*}Mi,jAs large as possible, so that the entire loss can be close to 0. Where m = 0.5 m=0.5m=0. 5 is a normal number.
However, there is a problem. When there are new targets or disappearing targets, no suitable match may be found.Here is a trick, and DeepMOT (CVPR2020, How to train your multi- object tracker), add an empty column and fill it with a small positive number,so that there is a place for the unmatched target.

As I said before, MMM is theoretically a symmetric matrix, so the difference between the symmetric positions should be as small as possible, so the cyclic loss term is defined:

L i d c y c l e = ∑ i , j ∣ M i , j − M j , i ∣ L_{id}^{cycle}=\sum_{i,j}|M_{i,j}-M_{j,i}| Lidcycle=i,jMi,jMj,i

The final loss function is a linear combination of the above three terms.

2.2 Occlusion Estimation

To estimate occlusion, you must first understand how to determine occlusion. In the true value file, there are labels for the target bounding box. Assume two bounding boxes bi , bj b_i,b_jbi,bjThere is overlap, the overlapping area is oi , j o_{i,j}oi,j. fruit oi , j o_{i,j}oi,jThe area of ​​is large enough, we consider that occlusion occurs between the two targets, specifically, define the function HHH判定是否遮挡:
H ( o i , j ) = 1 , if A r e a ( o i , j ) min ⁡ { A r e a ( b i ) , A r e a ( b j ) } > τ = 0 , o t h e r w i s e H(o_{i,j})=1, \text{if}\frac{Area(o_{i,j})}{\min\{Area{(b_i)}, Area{(b_j)}\}}>\tau\\ =0, otherwise H(oi,j)=1,ifmin{ Area(bi),Area(bj)}A r e a ( oi,j)>t=0,otherwise

The heatmap is used to indicate the position where occlusion may occur, which is predicted by CNN in implementation. Like TransCenter and other methods, we learn by generating the true value according to the Gaussian function. Specifically, in the area where we identify the occlusion, define the true value as Gaussian The maximum value of the function. The variance of the Gaussian function is related to the size of the region. Specifically:
Y x , y = max ⁡ { G ( oi , j , ( x , y ) ) } , s . t . H ( oi , j ) = 1 Y_{x,y}=\max\{G(o_{i,j}, (x,y))\}, st H(o_{i,j})=1Yx,y=max{ G(oi,j,(x,y))},s.t.H(oi,j)=1
其中 G = exp ⁡ { − ( ( x , y ) − p i , j ) 2 2 σ i , j 2 } G=\exp\{-\frac{((x,y)-p_{i,j})^2}{2\sigma_{i,j}^2}\} G=exp{ 2 pi,j2((x,y)pi,j)2} represents a Gaussian function, the mean is the center point pi of the occluded area, j p_{i,j}pi,j, variance σ i , j 2 \sigma_{i,j}^2pi,j2Depends on the area size.

This part of the loss is the same as similar methods, using focal loss, which is in the form of cross entropy.

The key point is how to use it in data association after estimating the occluded map.

In the association stage, the normal detection and trajectory association is first performed. For the trajectory that does not match the detection, it is likely that it was occluded in this frame, so it was not detected. Assume that the frame of the target in the previous frame
is bt − 1 b^{t-1}bt 1 , there is no frame in this frame, so we first use Kalman filter to estimate the framebtb^tbt.

If the target is occluded, the center point of the occluded area must be in btb^tbWithin t . We not only need to find the center point, but also find the target bjb^jthat is occluded with itbj.

To this end, for the predicted occlusion map, we calculate the btb^tbThe occlusion center point pi within t , k p_{i,k}pi,k. And calculate the overlapping area oi , j o_{i,j} with the rest of the detection frame according to the detection resultoi,j. To find the best j , kj,kj,k, 令:
j ∗ , k ∗ = arg ⁡ max ⁡ j , k G ( o i , j , p i , k ) j^*,k^*=\arg\max_{j,k}G(o_{i,j}, p_{i,k}) j,k=argj,kmaxG(oi,j,pi,k).

In fact, the essence of the above formula is to find the other closest frame based on the predicted occlusion center point. With the matching occlusion frame and center point, it is easy to get the position of the occluded frame according to the geometric relationship.

2.3 Overall structure

The overall structure is to parallelize the three parts of detection, Re-ID and occlusion estimation:
insert image description here

3. Evaluation

The method of this article is still very novel, using self-supervision for Re-ID, so that the inter-frame targets can be matched as much as possible. However, it is a bit complicated to put all the targets of the two frames together, but I can’t think of a better way. The occlusion part It is a bright spot, and the lost detection is picked up again by directly predicting the occlusion. This is a new idea for solving the occlusion problem.

Guess you like

Origin blog.csdn.net/wjpwjpwjp0831/article/details/125381002