The speed and accuracy of Multi-face tracking based on FairMOT training are very good

1. Foreword I
recently studied target tracking algorithms, such as deepsort, CenterTrack, JDE, FairMOT, etc. FairMOT is the current single-type multi-target SOTA algorithm, and it is a one-shot MOT framework, which can be modified to multi-type multi-target tracking according to your needs , So it was a whim to modify a face tracking algorithm based on this. This blog records the entire process of my face tracking development, including data preparation, model modification, fine-tuning training, model conversion onnx, deployment, etc.

2. Preparation

CenterNet: Paper
FairMOT: Paper
FairMOT Code: https://github.com/ifzhang/FairMOT
Read related papers and source code, and have a general understanding of tracking principles. You can also combine blogs and Zhihu articles to deepen your understanding, but most blogs are Copy articles or content are on the surface, we must learn to extract the essence from the numerous information and remove the dross while learning, and learn what we need quickly and effectively. Personally think it is best to look at the source code directly.

3. Run FairMOT

My environment: Ubuntu18 torch==1.6 2070GPU (installed cuda, cudnn).

DCN in FairMOT needs to be compiled, and this method is successful in my own test: git clone https://github.com/lbin/DCNv2/tree/pytorch_1.6 cd DCNv2 && ./make.sh. If you really can’t compile, you can comment out the import DCN related code under model.py,
download the model and CUHK-SYSU data set, run the demo and training code to ensure the normal operation of the project, select the CUHK-SYSU data set with the smallest amount of data to run the training code Just let us know the format of the data and label, and the whole process of processing the data and label during the training process. After understanding the source code's processing of input data and label, we prepare a face tracking data set.

4. Face tracking data preparation
We know that FairMOT is a one-shot MOT framework, that is, target detection and target tracking are completed in one step. At present, the tracking of pedestrians and vehicles in academia is mainly researched, so there is no open source data set for face tracking. , Then do we need to label the dataset ourselves? The answer is no. The best way to solve the difficulty is to bypass the difficulty . Open source has a large number of face recognition and detection data sets. Combining the two can create a face tracking data set of good quality. We first choose a slightly simple face detection data FDDB for training and detection, and then use CASIA-Webface for Reid training, then how to merge the two data sets? It is necessary to fully understand the data label of FairMOT in the previous step.
Label format:
[class] [identity] [x_center] [y_center] [width] [height]
Only track a single type of face, ie [class] is all 0.
FDDB does not perform reid training, so set [identity]=-1, pass the mask The method makes FDDB not calculate the loss of reid.
CASIA-Webface sets the same identity to the same [identity], and calculates the [x_center] [y_center] [width] [heig] of all faces through face detection algorithms such as centerface

Then the face tracking data set is ready, the format is the same as CUHK-SYSU

5. Start training.
Because DCN is very unfriendly to deployment, I will try HRnet18 for a little bit. At the beginning of the training, nothing needs to be changed. Just replace CUHK-SYSU with the adult face tracking data set, and it will be saved after training for 5 epochs. For the first model, let's run the demo first to ensure that the data production is okay, prevent data errors and continue training to waste time. When we see that we can track faces (although the accuracy is not high), we can rest assured to continue training. If there is nothing, check the data. I also made the data two or three times before successfully training. Due to the super-strong feature extraction ability of HRnet, convergence can be achieved by training to 30 epoch. If you have any problems in the process, look at Issues. It is very important to use issues reasonably, and then use Google or Baidu
2020-09-18-15-08: epoch: 37 |loss -0.294533 | hm_loss 0.015708 | wh_loss 0.382619 | off_loss 0.053386 | id_loss 0.565185 | time 6.833333 |

HRnet is also a very good paper, which is very powerful in classification, segmentation, detection, alignment, and posture. 6. The
HRnet
fine-tuning
hrnet18 as the backbone can run 30FPS in 2070, but the speed of other devices is seriously not real-time. At this time, we need to use other lighter models as the backbone, such as mobilenet series and EfficientNet series. At first I used mobilenetv2 I did backbone training, but it has not been able to converge. Later, I added fpn to mobilenet to emulate hrnet and integrate high and low-level features. The loss decreased significantly, but it still could not achieve the tracking effect of HRnet. Later, after adjusting the learning rate, the optimizer Adam and SGD, learning rate decay policy is based on a series of parameter tuning experiments such as step decay and cosine decay, warm up, loss function, etc. The model converges to an optimal effect, but the effect on small faces is still unstable, and the target loss tolerance time is short , It’s easy to switch ID after the target is lost.
Later, I was inspired by retinanet. Because I trained retinaface before, I added multi-scale to FaceMOT (FairMOT's face tracking version), and scaled the input data (608,480). Training, the full convolution model accepts any input, the original FairMOT is 1088,608 fixed input, I modified it to the input suitable for the face, and finally use mobilenet_fpn as the backbone training model to convergence, speed (80FPS) and accuracy (the target can be lost within 1S Tracking back) all meet the requirements, and mobilenet is easy to transplant.
FaceMOT

Summary
Due to limited space, many implementation details are not covered in this article. The focus is on the implementation process of an algorithm. The core of the algorithm still needs a good idea. How to customize the implementation according to your actual scenario requirements and computing power cost.

Guess you like

Origin blog.csdn.net/zengwubbb/article/details/108693096