Pre-trained model 1.1m
The speed seems okay.
Code address: https://github.com/dog-qiuqiu/FastestDet
Network | COCO mAP(0.5) | Resolution | Run Time(4xCore) | Run Time(1xCore) | FLOPs(G) | Params(M) |
---|---|---|---|---|---|---|
Yolo-FastestV1.1 | 24.40% | 320X320 | 26.60 ms | 75.74 ms | 0.252 | 0.35M |
Yolo-FastestV2 | 24.10% | 352X352 | 23.8 ms | 68.9 ms | 0.212 | 0.25M |
FastestDet | 27.8% | 512X512 | 21.51ms | 34.62ms | * | 0.25M |
Multi-platform benchmark
Equipment | Computing backend | System | Framework | Run time(Single core) | Run time(Multi core) |
---|---|---|---|---|---|
Radxa rock3a | RK3568(arm-cpu) | Linux(aarch64) | ncnn | 34.62ms | 21.51ms |
AMD | R5-5600(X86-cpu) | Linux(amd64) | ncnn | 2.16ms | 1.73ms |
Intel | i7-8700(X86-cpu) | Linux(amd64) | ncnn | 5.21ms | 4.73ms |
01 Overview
FastestDet is designed to replace the yolo-fastest series of algorithms. Compared with the existing lightweight target detection algorithms in the industry, such as yolov5n, yolox-nano, nanoDet, pp-yolo-tiny, FastestDet and these algorithms are not of the same order of magnitude. , FastestDet is several orders of magnitude smaller in terms of speed and parameter size (don't compare the size of the int8 model with my fp32 model, it's not fair), but the accuracy is naturally incomparable. FastestDet is designed for the ARM platform with limited computing resources , highlighting single-core performance, because in actual business scenarios, all CPU resources will not be used for model inference in the inference framework. , RK3568 to run real-time target detection, then FastestDet is a better choice, or if you don't want to occupy too much CPU resources on the mobile terminal, you can also use a single core and set cpu sleep to reason FastestDet, and run the algorithm under low power consumption conditions .
02
New Frame Algorithm
Let's talk about several important features of FastestDet:
-
Single lightweight detection head
-
anchor-free
-
Multi-candidate targets across grids
-
Dynamic positive and negative sample allocation
-
Simple data augmentation
Let me go into detail one by one:
Single lightweight detection head
This is to optimize the algorithm model on the network structure, mainly to improve the running speed of the algorithm and simplify the post-processing steps. You can look at the network structure of this block first:
In fact, the multi-detection head is designed to adapt to the detection of objects of different scales. The high-resolution detection head is responsible for detecting small objects, and the low-resolution detection head is responsible for detecting large objects, a kind of divide and conquer idea.
I personally think that the root cause is the perceptual field. Objects of different scales require different perceptual fields, and the perceptual fields of each layer of the model are different, including FPN, which is also a summary and fusion of the characteristics of different perceptual fields. I also refer to the idea of YOLOF for this single detection head. In the network structure, a 5x5 grouped convolutional parallel network structure similar to inception is used, and it is expected that the features of different perception fields can be integrated, so that a single detection head can also be adapted to detect different scales. object.
Anchor-Free
The original anchor-base algorithm needs to perform the anchor-bias operation on the dataset when training the model. Anchor-bias can be understood as clustering the width and height of the marked objects in the dataset to obtain a set of prior width and height. The network is here Optimize the width and height of the prediction box based on the prior width and height of the group. FastestDet uses the anchor-free algorithm. The model directly regresses the scale value of gt to the width and height of the feature map, and there is no prior width and height. This approach simplifies model post-processing. And for the anchor-base algorithm, the feature points of each feature map correspond to N anchor candidate boxes, and the feature points of each feature map of this anchor-free correspond to only one candidate box, so it is also advantageous in terms of inference speed. .
Multi-candidate targets across grids
This block is still borrowed from yolov5. Not only the grid where the gt center point is located is regarded as a candidate target, but also the three nearby ones are counted to increase the number of positive sample candidate frames, as shown in the following figure:
Dynamic positive and negative sample allocation
The so-called dynamic positive and negative sample allocation is actually to dynamically allocate positive and negative samples during the model training process, which is different from the previous yolo-fastest. After the anchor-bias of the original yolo-fastest is set, calculate the anchor-bias and gt For the scale of width and height, positive and negative samples are allocated to the fixed threshold of the scale card (refer to the practice of yolov5), and the anchor-bias and gt are unchanged during the training process, so the allocation of positive and negative samples is also unchanged during the training process. .
In the ATSS of FastestDet's positive and negative sample allocation reference, the mean value of SIOU calculated by the prediction frame and GT is set as the threshold for assigning positive and negative samples. If the SIOU threshold of the current prediction frame and GT is greater than the mean value, then it is a positive sample, and vice versa Of course. (Why don't you refer to simota? That's because when building the cost matrix, the weights of different losses have to be adjusted with hyperparameters, which is lazy)
Simple data augmentation
Be cautious about data enhancement of lightweight models. Originally, the learning ability is poor, and the brain is not very easy to use. You can do it for difficult problems. Therefore, simple data enhancements such as random translation and random scaling are used, and moscia and Mixup are not used.
03
Experimental results