Cross-camera pedestrian tracking

Cross-camera pedestrian tracking : In the field of security, cross-camera pedestrian tracking technology can be used as an important supplement to face recognition technology. It can continuously track pedestrians who cannot get clear shots of faces, and enhance the temporal and spatial continuity of data. This technology can be used for suspect tracking, missing person tracking, etc. The development and use of this algorithm includes outdoor and indoor scenes. Six sequential cameras are used to shoot pedestrians, and the fields of view of the front and rear cameras partially overlap. This algorithm can realize continuous tracking of pedestrians and is applied in the fields of smart security and smart retail.

The business logic of the algorithm alarm : track pedestrians, and output the pedestrians and IDs that were successfully tracked after a set of 6 videos is played. Confirm whether the person being tracked is continuous and correct by looking up the ID.

Environmental requirements : Including day, night and various weather conditions, etc., it is necessary to ensure light, especially indoor backlight.

Tasks : detection; identification; tracking.

Data characteristics : the detection target is different in size and distance, the detection target is diverse, the detection target angle is different, the detection target is occluded, aggregated, etc.

Evaluation indicators :

At the end of the test log, MOTA (Multiple Object Tracking Accuracy) and IDF1 in each camera will be printed out, and finally the average value of MOTA and IDF1 will be taken as the accuracy score. Among them, MOTA represents the accuracy of multi-target tracking, and IDF1 refers to the F value of pedestrian ID recognition in each pedestrian frame.

Algorithm process and implementation

In the choice of backbone, it was finally decided to use YOLOv5.

The YOLOv5 model was released publicly in 2020. It is improved based on the YOLOv3 model. There are four models: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The YOLOv5 model consists of a backbone network, a neck and a head.

YOLOv5 features:

  1. Using the Pytorch framework, it is easy to get started, and it is convenient to train your own data set;

  2. The configuration environment is simple, the model training is fast, and batch reasoning produces real-time results;

  3. Pytorch weight files are easily converted to ONXX, TensorRT;

  4. YOLOv5s has a high frame rate while taking into account the accuracy.

Highlights of YOLOv5 include:

  1. Adaptive anchor box calculation: The network outputs prediction boxes based on the initial anchor box, and adaptively calculates the best anchor box values ​​in different training sets. Then compare it with the real frame groundtruth, calculate the gap between the two, and then reverse update and iterate the network parameters;

  2. CSP structure: The original input is divided into two branches, and the convolution operation is performed separately to halve the number of channels, and then the input and output are the same size through the operation, so that the model can learn more features;

  3. Focus structure: use slicing operations to split high-resolution pictures (feature maps) into multiple low-resolution pictures/feature maps, that is, sampling every other column + splicing;

  4. Adaptive image scaling: Modified in the letterbox function of datasets.py in Yolov5's code, adaptively adding the least black border to the original image, etc.

For the competition data, data augmentation operations were performed, including left-right flipping; random cropping; image scaling; brightness adjustment; saturation adjustment; contrast adjustment; adding noise;

Before the formal training of the model, we first operated on the weight initialization, usually using random weight initialization, which can meet certain requirements for a wide range of tasks. However, it cannot be customized to maximize the convergence effect and performance for different tasks. Therefore, we use the official pre-training weights to initialize the backbone network parameters. Before this, we used the method of model warm-up to train the model first with a small amount of learning. After several rounds of epoches, the model optimization tended to be stable, and then reset the learning rate to start training.

After that, the CBAM attention mechanism is added. CBAM is a dual-channel attention mechanism, which is divided into two parts: spatial attention and channel attention. As can be seen from the figure below, the part in the red box is channel attention, and the part in the blue box is spatial attention. Channel attention comes first, and spatial attention follows. After the feature map is input, it first enters the channel attention, performs GAP and GMP based on the width and height of the feature map, and then obtains the attention weight of the channel through MLP, then obtains the normalized attention weight through the Sigmoid function, and finally weights it to the original On the input feature map, the recalibration of the original features by channel attention is completed.

We also replace the loss function of the model. The original CIOU loss takes into account the width-height ratio of the regression frame and the distance between the real frame and the center of the predicted frame. However, there is a problem that it only uses the width-height ratio as an influencing factor. The center point is consistent with the original image, the aspect ratio is the same but the width and height values ​​are different. According to the CIOUloss loss, it may not match the regression target, as shown in the figure below. So we use SIOU instead, which increases the angle loss: set the angle alpha, when the angle between the target frame and the regression frame is smaller than alpha, it converges to the minimum alpha, otherwise beta converges. whaosoft  aiot  http://143ai.com 

Finally, we convert the model and convert the pytorch model to TensorRT. And quantize the model to half floating point to maximize throughput. 

During the testing phase, ByteTrack was first used for object tracking. Due to the complexity of actual tracking scenarios, detectors often cannot obtain perfect detection results. In order to balance the true and false positive examples, most current MOT methods will choose a threshold, and only keep the detection results higher than this threshold for correlation to obtain tracking results, and the detection frames lower than this threshold will be discarded directly. ByteTrack believes that this strategy is unreasonable, and low-scoring detection frames often indicate the existence of objects (such as heavily occluded objects). Simply discarding these objects can lead to irreversible errors in MOT, including a large number of missed detections and trajectory interruptions, degrading the overall tracking performance. Therefore, ByteTrack is a new data association method that separates high-scoring frames from low-scoring frames, uses the similarity between low-scoring detection frames and tracking trajectories, mines real objects from low-scoring frames, and filters out background. In simple terms, it is a secondary matching process. 

Then make a second judgment on the tracked target (based on the comparison score). Since the front and rear frames have been tracked, we make a judgment based on the last comparison score of the target. If the comparison score is higher than 0.5, the tracking result will be used, otherwise Re-raise feature comparison.

Finally, for each frame of image, we only need to perform feature extraction on the targets that have not been tracked and associated. In this step, we did not use the for loop to complete it, but use the batch operation to batch the unrelated targets at once. deal with. This further reduces the time-consuming of the algorithm.

Summarize
  1. We chose to use Pytorch-based YoloV5s, which has fast model training and good performance, and provides a TensorRT deployment solution to reduce inference time;

  2. Combined with data augmentation, pre-training model, model warm-up and other programs, it provides targeted initialization weights for competition models to stimulate network performance potential;

  3. Use the CBAM attention mechanism and change the loss function to optimize the network model for SIOU to further improve the detection accuracy and robustness of the model;

  4. Use cosine annealing learning rate and MultiStepLR relay to adjust the learning rate, and use the method of forgetting data enhancement to stimulate model performance;

  5. Combining the two factors of tracking algorithm and comparison score to judge, increase the correlation of the front and rear frame targets while avoiding excessive redundant calculations, and speed up the algorithm;

  6. Using the batch operation, the targets that are not associated with the previous step are processed in batches at one time, which further reduces the time consumption of the algorithm.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/131733435