Ultra-lightweight target detection model NanoDet

Huawei P30 uses NCNN to transplant and run the benchmark, which only takes 10.23 milliseconds per frame , which is 3 times faster than yolov4-tiny and 6 times smaller in parameters. COCO mAP (0.5:0.95) can reach 20.6. And the model weight file is only 1.8mb , which can be said to be quite friendly compared to the model with dozens of megabytes.

Android Demo

Project address (provide training code to Android deployment one-stop solution):

RangiLyu/nanodet: ⚡Super fast and lightweight anchor-free object detection model. Only 1.8mb and run 97FPS on cellphone (github.com)​github.com

 

Preface

Deep learning target detection has been developed for many years, from Two-stage to One-stage, from Anchor-base to Anchor-free, and then to this year's use of Transformer for target detection. Various methods flourish, but target detection algorithms on the mobile side Above, Anchor-base models such as the yolo series and SSD have always occupied a dominant position. The main purpose of this project is to open source a real-time Anchor-free detection model for mobile terminals, which can provide performance no less than that of the yolo series, and is also convenient for training and transplantation.

In fact, since a large number of anchor-free papers were published last year, I have always wanted to port the anchor-free model to mobile or embedded devices. At that time, I tried to reduce the weight of FCOS, but the effect was not as good as mobilenet+yolov3, so it was temporarily put on hold. After analysis, it is mainly because the centerness branch of FCOS is difficult to converge on the lightweight model, and some papers that have been improved on FCOS did not solve this problem.

Until the middle of this year, I suddenly brushed arxiv. 

@李翔

 The teacher’s paper Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection , the GFocalLoss proposed in the paper perfectly removes the Centerness branch of the FCOS series, and the rise in the coco data set is significant. How can such a good article not like it? What? The emergence of GFL not only removes the Centerness that is difficult to train, but also eliminates a large number of convolutions on this branch, reduces the computational overhead of the detection head, and is very suitable for lightweight deployment on the mobile terminal.

Post Mr. Li Xiang's interpretation of GFocal Loss:

Li Xiang: Generalized Focal Loss in the vernacular zhuanlan.zhihu.comicon

Lightweight detection head

After finding a suitable loss function, the next key is how to make it work on the lightweight model. The first thing that needs to be optimized for the mobile terminal is the detection head: the FCOS series uses a shared weight detection head, that is, the same set of convolution prediction detection frames are used for the multi-scale Feature Map from FPN, and then each layer uses a learnable Scale The value is used as a coefficient to scale the predicted frame.

                                                                    

FCOS model architecture

The advantage of this is that the amount of parameters of the detection head can be reduced to 1/5 of that in the state of not sharing the weight. This is very useful for large models with hundreds of channels of convolution with the detection head alone, but for lightweight models, sharing the weight detection head does not make much sense. Since the mobile terminal model inference is calculated by the CPU, sharing the weight will not accelerate the inference process, and when the detection head is very light, the sharing weight will further reduce its detection ability, so it is still selected to use one feature for each layer. Group convolution is more appropriate.

At the same time, the FCOS series uses Group Normalization as a normalization method on the detection head. GN has many advantages compared to BN, but it has one disadvantage: BN can directly integrate its normalized parameters into the convolution during inference. , This step of calculation can be omitted, but GN cannot. In order to save the time of the normalization operation, I chose to replace GN with BN.

The FCOS detection head uses 4 256-channel convolutions as a branch, that is to say, there are a total of 8 c=256 convolutions on the two branches of border regression and classification, which is very computationally intensive. In order to make it lighter, I first chose to use depth separable convolution instead of ordinary convolution, and reduced the number of convolution stacks from 4 to 2 groups. In terms of the number of channels, 256 dimensions are compressed to 96 dimensions. The reason for choosing 96 is that the number of channels needs to be kept as a multiple of 8 or 16, so that you can enjoy the parallel acceleration of most inference frameworks. Finally, using the yolo series for reference, the bounding box regression and classification are calculated using the same set of convolutions, and then split into two. The following figure shows the structure of the final lightweight detection head, which is very small:

                                                                         

NanoDet detection head

FPN layer improvement

There are many improvements to FPN, such as EfficientDet using BiFPN, yolo v4 and yolo v5 using PAN, in addition to BalancedFPN and so on. Although BiFPN is powerful, the stacked feature fusion operation will inevitably reduce the running speed, while PAN has only two paths, top-down and bottom-up, which is very concise and is a good choice for feature fusion of lightweight models. .

The original PAN and the PAN in yolo both use stride=2 convolution to scale from large-scale Feature Map to small-scale. For the light-weight principle, I chose to completely remove all convolutions in PAN, and only retain the 1x1 convolutions extracted from the backbone network features to align the feature channel dimensions, and both up-sampling and down-sampling are done using interpolation. Different from the concatenate operation used by yolo, I chose to add the multi-scale Feature Map directly, making the calculation of the entire feature fusion module very, very small.

The structure of the final very small PAN is also very simple:

                                                                    

Ultra-lightweight PAN

Backbone's choice

I originally thought about modifying a lightweight backbone by myself, but after evaluating it, I felt that the workload was too large (the electricity bill for training models at home was too expensive), and I planned to choose to use some of the existing lightweight backbone networks. The initial choices are MobileNet series, GhostNet, ShuffleNet, and more recently EfficientNet. After evaluating the amount of parameters, calculations, and weights, ShuffleNetV2 was selected as the backbone network because it is the smallest in these models with similar accuracy, and it is also friendly to mobile CPU inference.

In the end, I chose to use ShuffleNetV2 1.0x as the backbone, removed the last layer of convolution, and extracted 8, 16, 32 times down-sampled features and input them into PAN for multi-scale feature fusion. The entire backbone model uses the code provided by Torchvision, which can directly load the imagenet pre-training weights provided on Torchvision, which is of great help in accelerating the convergence of the model. By the way, some recent papers pointed out that using the pre-trained weight initialization model for classification is not as effective as the random initialization on the detection task, but this will cost more training steps. I have not tested it yet. Welcome everyone to try~

              

NanoDet overall model structure

Model performance

After lightening the three major modules (Head, Neck, Backbone) of the one-stage detection model, the current open source NanoDet-m model is obtained. With an input resolution of 320x320, the Flops of the entire model is only 0.72B , And yolov4-tiny has 6.96B, which is nearly ten times smaller! The parameter of the model is also only 0.95M, and the weight file is only 1.8mb after 16-bit storage using ncnn optimize, which is very suitable for mobile deployment, can effectively reduce the size of the APP, and is also more friendly to lower-end embedded devices. .

Although the model is very lightweight, the performance is still strong. For small models, we often choose to use AP50, which is a more forgiving evaluation index, for comparison. Here I choose to use a more stringent COCO mAP (0.5:0.95) as the evaluation index, while taking into account the accuracy of detection and positioning. Tested on COCO val 5000 pictures without using Testing-Time-Augmentation, 320 resolution input can reach 20.6 mAP, which is 4 points higher than tiny-yolov3 and only 1% lower than yolov4-tiny. Keep the input resolution the same as yolo, and the scores are the same when both 416 inputs are used.

Finally, I ran the benchmark after deploying it on the mobile phone with ncnn. The forward calculation time of the model is only about 10 milliseconds. Compared with yolov3 and v4 tiny, both are on the order of 30 milliseconds. On the Android camera demo app, you can easily run to 40+FPS~ by including image preprocessing, post-processing of the detection frame, and time of drawing the detection frame. (Regarding the deployment of the model from pytorch to Android, and how to use C++ to implement the post-processing of the GFL detection frame, I will post an article to introduce it in detail)

                                               

ncnn Android benchmark comparison

to sum up

Generally speaking, NanoDet doesn't have many innovations. It is a purely engineering project. Its main job is to apply some excellent papers in the academic world to lightweight models on the mobile terminal. Finally, through the combination of these papers, a detection model that takes into account accuracy, speed and volume is obtained.

In order to allow everyone to quickly use NanoDet and facilitate training and deployment, I have open sourced all the Pytorch training code, NCNN-based Linux and windowsC++ deployment code, and Android camera Demo. At the same time, the Readme also contains a very detailed tutorial. Welcome everyone Use, welcome to raise issue~

By the way, NanoDet training does not use many data enhancement techniques, and the model structure is also very simple. That is to say, there should be a lot of room for improvement in mAP. If there is a small partner who is willing to change it and increase it by a few points, then It could not be better.

Huawei P30 uses NCNN to transplant and run the benchmark, which only takes 10.23 milliseconds per frame , which is 3 times faster than yolov4-tiny and 6 times smaller in parameters. COCO mAP (0.5:0.95) can reach 20.6. And the model weight file is only 1.8mb , which can be said to be quite friendly compared to the model with dozens of megabytes.

Android Demo

Project address (provide training code to Android deployment one-stop solution):

RangiLyu/nanodet: ⚡Super fast and lightweight anchor-free object detection model. Only 1.8mb and run 97FPS on cellphone (github.com)​github.com

 

Preface

Deep learning target detection has been developed for many years, from Two-stage to One-stage, from Anchor-base to Anchor-free, and then to this year's use of Transformer for target detection. Various methods flourish, but target detection algorithms on the mobile side Above, Anchor-base models such as the yolo series and SSD have always occupied a dominant position. The main purpose of this project is to open source a real-time Anchor-free detection model for mobile terminals, which can provide performance no less than that of the yolo series, and is also convenient for training and transplantation.

In fact, since a large number of anchor-free papers were published last year, I have always wanted to port the anchor-free model to mobile or embedded devices. At that time, I tried to reduce the weight of FCOS, but the effect was not as good as mobilenet+yolov3, so it was temporarily put on hold. After analysis, it is mainly because the centerness branch of FCOS is difficult to converge on the lightweight model, and some papers that have been improved on FCOS did not solve this problem.

Until the middle of this year, I suddenly brushed arxiv. 

@李翔

 The teacher’s paper Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection , the GFocalLoss proposed in the paper perfectly removes the Centerness branch of the FCOS series, and the rise in the coco data set is significant. How can such a good article not like it? What? The emergence of GFL not only removes the Centerness that is difficult to train, but also eliminates a large number of convolutions on this branch, reduces the computational overhead of the detection head, and is very suitable for lightweight deployment on the mobile terminal.

Post Mr. Li Xiang's interpretation of GFocal Loss:

Li Xiang: Generalized Focal Loss in the vernacular zhuanlan.zhihu.comicon

Lightweight detection head

After finding a suitable loss function, the next key is how to make it work on the lightweight model. The first thing that needs to be optimized for the mobile terminal is the detection head: the FCOS series uses a shared weight detection head, that is, the same set of convolution prediction detection frames are used for the multi-scale Feature Map from FPN, and then each layer uses a learnable Scale The value is used as a coefficient to scale the predicted frame.

                                                                    

FCOS model architecture

The advantage of this is that the amount of parameters of the detection head can be reduced to 1/5 of that in the state of not sharing the weight. This is very useful for large models with hundreds of channels of convolution with the detection head alone, but for lightweight models, sharing the weight detection head does not make much sense. Since the mobile terminal model inference is calculated by the CPU, sharing the weight will not accelerate the inference process, and when the detection head is very light, the sharing weight will further reduce its detection ability, so it is still selected to use one feature for each layer. Group convolution is more appropriate.

At the same time, the FCOS series uses Group Normalization as a normalization method on the detection head. GN has many advantages compared to BN, but it has one disadvantage: BN can directly integrate its normalized parameters into the convolution during inference. , This step of calculation can be omitted, but GN cannot. In order to save the time of the normalization operation, I chose to replace GN with BN.

The FCOS detection head uses 4 256-channel convolutions as a branch, that is to say, there are a total of 8 c=256 convolutions on the two branches of border regression and classification, which is very computationally intensive. In order to make it lighter, I first chose to use depth separable convolution instead of ordinary convolution, and reduced the number of convolution stacks from 4 to 2 groups. In terms of the number of channels, 256 dimensions are compressed to 96 dimensions. The reason for choosing 96 is that the number of channels needs to be kept as a multiple of 8 or 16, so that you can enjoy the parallel acceleration of most inference frameworks. Finally, using the yolo series for reference, the bounding box regression and classification are calculated using the same set of convolutions, and then split into two. The following figure shows the structure of the final lightweight detection head, which is very small:

                                                                         

NanoDet detection head

FPN layer improvement

There are many improvements to FPN, such as EfficientDet using BiFPN, yolo v4 and yolo v5 using PAN, in addition to BalancedFPN and so on. Although BiFPN is powerful, the stacked feature fusion operation will inevitably reduce the running speed, while PAN has only two paths, top-down and bottom-up, which is very concise and is a good choice for feature fusion of lightweight models. .

The original PAN and the PAN in yolo both use stride=2 convolution to scale from large-scale Feature Map to small-scale. For the light-weight principle, I chose to completely remove all convolutions in PAN, and only retain the 1x1 convolutions extracted from the backbone network features to align the feature channel dimensions, and both up-sampling and down-sampling are done using interpolation. Different from the concatenate operation used by yolo, I chose to add the multi-scale Feature Map directly, making the calculation of the entire feature fusion module very, very small.

The structure of the final very small PAN is also very simple:

                                                                    

Ultra-lightweight PAN

Backbone's choice

I originally thought about modifying a lightweight backbone by myself, but after evaluating it, I felt that the workload was too large (the electricity bill for training models at home was too expensive), and I planned to choose to use some of the existing lightweight backbone networks. The initial choices are MobileNet series, GhostNet, ShuffleNet, and more recently EfficientNet. After evaluating the amount of parameters, calculations, and weights, ShuffleNetV2 was selected as the backbone network because it is the smallest in these models with similar accuracy, and it is also friendly to mobile CPU inference.

In the end, I chose to use ShuffleNetV2 1.0x as the backbone, removed the last layer of convolution, and extracted 8, 16, 32 times down-sampled features and input them into PAN for multi-scale feature fusion. The entire backbone model uses the code provided by Torchvision, which can directly load the imagenet pre-training weights provided on Torchvision, which is of great help in accelerating the convergence of the model. By the way, some recent papers pointed out that using the pre-trained weight initialization model for classification is not as effective as the random initialization on the detection task, but this will cost more training steps. I have not tested it yet. Welcome everyone to try~

              

NanoDet overall model structure

Model performance

After lightening the three major modules (Head, Neck, Backbone) of the one-stage detection model, the current open source NanoDet-m model is obtained. With an input resolution of 320x320, the Flops of the entire model is only 0.72B , And yolov4-tiny has 6.96B, which is nearly ten times smaller! The parameter of the model is also only 0.95M, and the weight file is only 1.8mb after 16-bit storage using ncnn optimize, which is very suitable for mobile deployment, can effectively reduce the size of the APP, and is also more friendly to lower-end embedded devices. .

Although the model is very lightweight, the performance is still strong. For small models, we often choose to use AP50, which is a more forgiving evaluation index, for comparison. Here I choose to use a more stringent COCO mAP (0.5:0.95) as the evaluation index, while taking into account the accuracy of detection and positioning. Tested on COCO val 5000 pictures without using Testing-Time-Augmentation, 320 resolution input can reach 20.6 mAP, which is 4 points higher than tiny-yolov3 and only 1% lower than yolov4-tiny. Keep the input resolution the same as yolo, and the scores are the same when both 416 inputs are used.

Finally, I ran the benchmark after deploying it on the mobile phone with ncnn. The forward calculation time of the model is only about 10 milliseconds. Compared with yolov3 and v4 tiny, both are on the order of 30 milliseconds. On the Android camera demo app, you can easily run to 40+FPS~ by including image preprocessing, post-processing of the detection frame, and time of drawing the detection frame. (Regarding the deployment of the model from pytorch to Android, and how to use C++ to implement the post-processing of the GFL detection frame, I will post an article to introduce it in detail)

                                               

ncnn Android benchmark comparison

to sum up

Generally speaking, NanoDet doesn't have many innovations. It is a purely engineering project. Its main job is to apply some excellent papers in the academic world to lightweight models on the mobile terminal. Finally, through the combination of these papers, a detection model that takes into account accuracy, speed and volume is obtained.

In order to allow everyone to quickly use NanoDet and facilitate training and deployment, I have open sourced all the Pytorch training code, NCNN-based Linux and windowsC++ deployment code, and Android camera Demo. At the same time, the Readme also contains a very detailed tutorial. Welcome everyone Use, welcome to raise issue~

By the way, NanoDet training does not use many data enhancement techniques, and the model structure is also very simple. That is to say, there should be a lot of room for improvement in mAP. If there is a small partner who is willing to change it and increase it by a few points, then It could not be better.

Guess you like

Origin blog.csdn.net/weixin_43868576/article/details/110685925