Deep and shallow feature fusion - CBNet

written in front

This series of blogs "Deep and Shallow Feature Fusion" briefly introduces several newer deep and shallow feature fusion algorithms, most of which are my paper notes, recording a question that a deep learning primary school student thinks of when reading a paper.


Thesis title: CBNet: A Novel Composite Backbone Network Architecture for Object Detection

Paper link: https://arxiv.org/pdf/1909.03625.pdf

github address: https://github.com/PKUbahuangliuhe/CBNet

Overall introduction

This paper comes from Peking University, published in September 2019, and the time is very new.

Research background : The author believes that in the current target detection algorithm based on deep learning, the backbone network responsible for feature extraction is mostly originally designed for image classification. When using the features extracted by these networks to directly perform the target detection task of different data sets, the optimal effect may not be achieved.

Research method : directly designing a new backbone and pre-training is very slow and difficult, so the author takes the fusion of the existing backbone as the starting point of the research

Feature fusion algorithm : Multiple backbone networks with the same structure and different parameters are juxtaposed, and the features of each stage of adjacent backbones are integrated horizontally in one direction, and only the features extracted by the final backbone network are used for subsequent target detection or semantic segmentation tasks.

Algorithm introduction

The idea of ​​backbone integration proposed in this article is easy to understand, just look at the following figure to understand:

Illustration of the proposed Composite Backbone Network (CBNet) architecture for object detection

The advantage of this fusion is that, compared with the traditional single backbone, the features extracted in each stage of Lead Backbone include some features extracted through the convolution kernel of a certain stage multiple times. It can be roughly understood that the convolution kernel of this stage has passed through several times.

Comparative Experiment

  • Comparison of different composite methods

The author compared the effects of different composite methods between adjacent backbones, and finally found that the form of "left high-level + right low-level" has the best effect. When comparing the effects of several composite methods and trying to analyze the reasons, I feel a bit far-fetched. From the article alone, I still can't quite understand why "left high-level + right low-level" can make the latter get enhanced, and "left low-level + right high-level" can make the latter get harmed. Also welcome to discuss with you.

  • Comparison of the number of backbones

The author found through experiments that with the increase of the number of backbones, the overall effect is improving, but the subsequent memory cost cannot be ignored. In the end, the author recommends using 2 or 3 backbones.

 Two backbones will expand the model size to less than twice the original size (mb in the figure):

problem record

The following questions are all the questions that pop up in the author's mind when reading the paper, and are suitable for Xiaobai's daily repair of knowledge loopholes.

Questions about this article:

Q A
How to understand the benchmark?  
identical backbone的identical? It only means that the structure is the same, but the weight is different

When comparing CBNet and RCNN, why does pre-training need to be done when using RCNN as the backbone of the detector?

 

What are the general upsample operations? In addition to the role of upsample in dimension, is there any other use?

 

Why doesn't the backbone of the one stage method remain the same?

 
   

Questions outside of this article:

Q A
ResNet and ResNeXt  

Some newly proposed target detection algorithms need to look at

 

 DetNet, FishNet? What is the difference between the backbone designed for target detection and the original backbone for image classification?

 

RCNN?

 
FPN、RPN?  

Detectron?

 
hrnet? Resolution remains unchanged  
Learning rate warm up  
soft-NMS?  

Specific algorithms for several indicators of target detection

 

single\multi-scale training\inference?

 

 

 

 

Guess you like

Origin blog.csdn.net/s000da/article/details/102495156