written in front
This series of blogs "Deep and Shallow Feature Fusion" briefly introduces several newer deep and shallow feature fusion algorithms, most of which are my paper notes, recording a question that a deep learning primary school student thinks of when reading a paper.
Thesis title: CBNet: A Novel Composite Backbone Network Architecture for Object Detection
Paper link: https://arxiv.org/pdf/1909.03625.pdf
github address: https://github.com/PKUbahuangliuhe/CBNet
Overall introduction
This paper comes from Peking University, published in September 2019, and the time is very new.
Research background : The author believes that in the current target detection algorithm based on deep learning, the backbone network responsible for feature extraction is mostly originally designed for image classification. When using the features extracted by these networks to directly perform the target detection task of different data sets, the optimal effect may not be achieved.
Research method : directly designing a new backbone and pre-training is very slow and difficult, so the author takes the fusion of the existing backbone as the starting point of the research
Feature fusion algorithm : Multiple backbone networks with the same structure and different parameters are juxtaposed, and the features of each stage of adjacent backbones are integrated horizontally in one direction, and only the features extracted by the final backbone network are used for subsequent target detection or semantic segmentation tasks.
Algorithm introduction
The idea of backbone integration proposed in this article is easy to understand, just look at the following figure to understand:
The advantage of this fusion is that, compared with the traditional single backbone, the features extracted in each stage of Lead Backbone include some features extracted through the convolution kernel of a certain stage multiple times. It can be roughly understood that the convolution kernel of this stage has passed through several times.
Comparative Experiment
- Comparison of different composite methods
The author compared the effects of different composite methods between adjacent backbones, and finally found that the form of "left high-level + right low-level" has the best effect. When comparing the effects of several composite methods and trying to analyze the reasons, I feel a bit far-fetched. From the article alone, I still can't quite understand why "left high-level + right low-level" can make the latter get enhanced, and "left low-level + right high-level" can make the latter get harmed. Also welcome to discuss with you.
- Comparison of the number of backbones
The author found through experiments that with the increase of the number of backbones, the overall effect is improving, but the subsequent memory cost cannot be ignored. In the end, the author recommends using 2 or 3 backbones.
Two backbones will expand the model size to less than twice the original size (mb in the figure):
problem record
The following questions are all the questions that pop up in the author's mind when reading the paper, and are suitable for Xiaobai's daily repair of knowledge loopholes.
Questions about this article:
Q | A |
How to understand the benchmark? | |
identical backbone的identical? | It only means that the structure is the same, but the weight is different |
When comparing CBNet and RCNN, why does pre-training need to be done when using RCNN as the backbone of the detector? |
|
What are the general upsample operations? In addition to the role of upsample in dimension, is there any other use? |
|
Why doesn't the backbone of the one stage method remain the same? |
|
Questions outside of this article:
Q | A |
ResNet and ResNeXt | |
Some newly proposed target detection algorithms need to look at |
|
DetNet, FishNet? What is the difference between the backbone designed for target detection and the original backbone for image classification? |
|
RCNN? |
|
FPN、RPN? | |
Detectron? |
|
hrnet? Resolution remains unchanged | |
Learning rate warm up | |
soft-NMS? | |
Specific algorithms for several indicators of target detection |
|
single\multi-scale training\inference? |