[Target Tracking] 2. FairMOT | Balancing Target Detection and Re-ID Tasks in Multi-Target Tracking | IJCV2021


insert image description here

论文:FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking

Code: https://github.com/ifzhang/FairMOT

Source: IJCV2021

1. Background

What is the Multi-object tracking (MOT) task:

  • Estimate the trajectory of an object of interest in a video

The importance of MOT is a very important task in computer vision:

  • Facilitates intelligent video analysis
  • human-computer interaction

How did the method at that time solve the MOT task

  • Many methods construct the MOT task into a multi-task learning model, including:
    • Target Detection
    • reid

But the authors argue that the two tasks are competing

The previous methods generally use reid as the second task after detection, and its effect will be affected by the effect of target detection, and the network is generally biased towards the first-stage target detection network, which is very unfair to reid, and the two-stage MOT method is difficult to achieve real-time reasoning. The reason is that when the number of targets is large, the two models do not share features, and the reid model needs to extract features for each frame.

Therefore, a single-stage tracking method appeared later, using a model to learn the characteristics of detection and reid:

  • Voigtlaender (a reid branch is added to Mask RCNN, and each proposal learns reid features. Although the speed is improved, the effect is far inferior to the two-stage method. Generally, the detection effect is very good, but the tracking effect becomes worse.

Therefore, the author of this article first explores the reasons for the above problems:

  • anchors: anchors were originally designed for target detection and are not suitable for learning reid features
    • The anchors-based method needs to generate anchors for the target to be detected, and then extract the reid feature based on the detection result. Therefore, the model will enter the "detect first, then reid" mode during training, and the reid feature will be worse
    • Moreover, the anchor will bring uncertainty to the learning of reid features, especially in crowded scenes, one anchor may correspond to multiple individuals, and multiple anchors may also correspond to one individual
  • Feature sharing: The features required for these two tasks are different, so feature sharing cannot be performed directly
    • Reid requires more low-level features to identify discriminative features between different instances of the same category
    • Object detection requires a combination of high-level and low-level information to learn categories and locations
    • Single-stage object tracking methods can create feature conflicts and reduce performance
  • Feature dimension: (reid needs higher-dimensional features, MOT needs low-dimensional features)
    • Reid features generally use a feature dimension of 512 or 1024, which is much larger than the dimension of target detection (generally category + positioning), so reducing the dimension of reid features is conducive to the balance of the two tasks
    • MOT tracking and reid are different. The MOT task only needs one-to-one matching of the front and rear frame targets. Reid needs more discriminative high-dimensional features to match query samples from a large number of candidate samples. MOT does not require high-dimensional features.
    • Low-dimensional reid features will improve inference speed

insert image description here

This paper proposes a fair method FairMOT: Based on CenterNet

  • Treat object detection and reid equally, instead of detecting first and then reid
  • Not a simple combination of CenterNet and REID

The structure diagram of FairMOT is shown in Figure 1:

  • It consists of two branches, respectively for target detection and extraction of reid features
  • The target detection branch is anchor-free, which is based on the feature map to predict the center point and size of the feature
  • The reid branch predicts reid features for each object center position
  • Such two branches are parallel rather than in series, which can better balance the two tasks

insert image description here

2. Method

2.1 Backbone

The author uses ResNet-34 as the basic backbone, which can better balance speed and accuracy

A stronger version can also be achieved using DLA

2.2 Detection branch

The detection branch uses CenterNet, centerNet contains a heatmap head, a wh head, and an offset head

2.3 Re-ID branch

Based on the output characteristics of the backbone, the author built the reid branch:

  • The features extracted by the reid branch are far away on different targets and short on the same target
  • So the author uses 128 kernel to extract the reid feature for each position on the feature map, and the obtained feature is 128xHxW

Re-ID loss:

The way reid features are learned is formalized as a classification task, where different instances of the same individual are all considered to be of the same class

For all gt boxes in a picture, the center point position will be obtained, and then the reid feature will be extracted, and the fully connected layer and softmax operation will be used to map it to the classification feature

Assuming that the gt category vector is L and the predicted one is p, then the reid loss is:

insert image description here

  • K is the number of all individuals in the training data
  • During training, only the individual features centered on the target will participate in the training

2.4 Training FairMOT

The author jointly trains the detection and reid branches and adds up all losses

Note: The author used uncertainty loss to automatically balance the two tasks:

insert image description here

  • w 1 w_1 w1Sum w 2 w_2w2is a learnable parameter for balancing the two tasks

In addition, the author also proposed a single image training method to train FairMOT on the image-level target detection dataset (such as COCO, CrowdHuman, etc.)

  • The author only belongs to one picture at a time, treats each target in the picture as an independent individual, and treats each bbox as a separate category

2.5 Online Inference

1. Network reasoning

  • Input 1088x608
  • For the predicted heatmap, NMS filtering is performed based on the heatmap score to extract peak key points (NMS is 3x3 maximum pooling), and keypoints greater than the threshold are retained
  • Calculate the box size based on the retained key points and wh, offset branches

2、Online Association

  • First, the detection frame detected in the first frame is established as a tracklet (short track)
  • After that, in each subsequent frame, the two-stage matching strategy will be used to match the detected bbox and tracklet
    • The first stage of matching strategy: use Kalman filter and reid features to get the initial tracking result, use Kalman filter to predict the tracklet position of the following frame, and calculate the Mahalanobis distance of the prediction frame and the detection frame ( D m D_mDm). Then D m ​​D_mDmFusion with cosine distance, D = λ D r + ( 1 − λ ) D m D=\lambda D_r + (1-\lambda) D_mD=λDr+(1l ) Dm, λ = 0.98 \lambda=0.98l=0.98 is the weight. whenD m ​​D_mDmGreater than the threshold τ 1 = 0.4 \tau_1 = 0.4t1=0.4 , is set to infinity
    • The second stage of the matching strategy: For unmatched detection results and tracklets, the author will use the coincidence rate between boxes to match, the threshold τ 2 = 0.5 \tau_2 = 0.5t2=0.5 , will update tracklets features
  • Finally, it will reinitialize the unmatched detection results and keep 30 frames for unmatched tracklets

3. Effect

3.1 Dataset

Training dataset:

  • ETH and CityPerson: Only the label information of the box, so it is used to train the detection branch
  • CalTech, MOT17, CUHK-SYSU, and PRW have box and identity annotation information, and can train two branches

Test dataset:

  • 2DMOT15、MOT16、MOT17、MOT29

Evaluation method:

  • Detection effect: mAP
  • reid 特征:True Positive Rate, false accept rate =0.1(TPR@FAR=0.1)
  • The whole tracking effect: CLEAR, IDF1

3.2 Implementation Details

  • Use the variant of DLA-34 as the backbone, and the pre-trained model on COCO as the initial model
  • Optimizer: Adam, initial learning rate 1 0 − 4 10^{-4}104
  • Epoch: 30, at 20 epochs the learning rate is reduced to 1 0 − 5 10^{-5}105
  • batch size:12
  • Input data size: 1088x608 (the resolution of the feature map is 272x152)

3.3 Ablation experiment

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

3.4 Final effect

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/jiaoyangwm/article/details/131831032