Target Detection - Intensive Reading of FPN Papers

1. FPN network structure

Based on the feature pyrimid to detect objects of different scales, there are 4 ideas

(a) Use the image pyramid to construct the feature pyramid, which is calculated independently at each image scale

(b) Only use features of a single scale

(c) Reuse the pyramidal feature hierarchy computed by the convolutional neural network as if it were a feature image pyramid.

(d) Our proposed Feature Pyramid Network (FPN) is as fast as (b) and (c), but more accurate. 

FPN-Structure: Based on the inherent pyramid hierarchy of CNN, the top-down path is constructed through skip connection, and only a small amount of cost is required to generate the feature pyramid, and each scale of the feature pyramid has a high-level semantic feature, and finally at each level of the feature pyramid target detection

 FPN consists of two parts: 1. Bottom-up pathway 2. Top-down pathway and lateral connections

bottom-up path:

Divide the backbone into multiple stages, and define each stage as a pyramid level

Output: In each stage, the size of all layer output feature maps is the same, and the output of the last layer is taken as the output of the stage, because the deepest layer in each stage should have the strongest features

Downsampling: The downsampling ratio between adjacent stages is 2

top-down path:

Motivation: high-level semantic information helps identify targets but is not conducive to locating targets, low-level spatial information is harmful to identifying targets but helps to locate targets

Construction: build top-down path through skip connection

Note: Before starting the top-down path, a 1×1 convolution is used on the top layer of the bottom-up path to generate a lower resolution feature map

skip connection:

1. Upsample the coarser-resolution feature map from the top-down path. The upsampling ratio is 2, and the nearest neighbor upsampling is used for simplicity

2. Use 1×1 convolution to reduce the number of channels of the corresponding feature map from the bottom-up path

3. Perform element-wise addition on the 2 feature maps (the size and the number of channels are the same) obtained in the previous 2 steps

Nearest neighbor interpolation:

 2. FPN-ResNet structure

In this paper, the output of the last 4 stages [C2.C3.C4.C5] of ResNet (relative to the input downsampling ratios are 4, 8, 16, 32) is defined as 4 pyramid levels, and the first stage is not The output of is included in FPN because of its relatively large memory footprint.

 3. Application in Faster RCNN

FPN applied to RPN:

FPN output: [P2,P3,P4,P5,P6], where [P6] is just a downsampling with a step size of 2, which is introduced to cover a larger anchor scale [512*512] RPN structure: 1 3×3 convolution + 2 parallel 1×1 convolutions

RPN input: run the same RPN on 5 pyramid levels

Anchor: 5 levels with a total of 5×3=15 types

Anchor scale: After the introduction of FPN, the anchor on each pyrimid level does not need to be multi-scale. The anchors on each pyramid level have only one scale, and the scales of the anchors on [P2, P3, P4, P5, P6] are

Aspect ratio: There are 3 aspect ratio anchors on each level (1:2, 1:1, 2:1)

 FPN for Fast RCNN

A RoI of size (w, h) on the input image should be assigned to the level Pk on the feature pyramid:

Among them, 224 is the pre-training size of ImageNet, and k0 is the target pyramid level to which a 224×224 RoI should be mapped. The Faster RCNN in the original ResNet uses C4 as the input of the RPN, so this article sets k0 to 4.

If the scale of the RoI is less than 224×224 (such as 112×112, which is exactly half of 224), it will be mapped to a layer with a large number of pixels (such as 3).

ResNet uses conv5 as the head at the top of the feature map output by conv4, but this article has used conv5 to build FPN. Therefore, this article uses RoI pooling to generate a 7×7 feature, then uses two 1024-dimensional FC layers + ReLU, and then inputs to the final classification layer and BBox regression layer. Compared to the standard conv5 head, our method has fewer parameters and is faster 

4. Details of Faster R-CNN+FPN

 In fact, FPN is not an independent target detection algorithm, but is equivalent to a Backbone network!

Guess you like

Origin blog.csdn.net/acx0000/article/details/123029274