[Principle] One article to understand FPN (Feature Pyramid Networks)

Paper: feature pyramid networks for object detection
paper link: https://arxiv.org/abs/1612.03144

This paper is an article from CVPR2017, which uses feature pyramids for target detection. The multi-scale object detection algorithm proposed by the author: FPN (feature pyramid networks) . The FPN network introduces the feature pyramid model into Faster R-CNN, which achieves state-of-the-art without sacrificing memory and speed, and also achieves good results in the detection of small objects.

1 Introduction 

image pyramid

When the human eye sees an object, a notable feature is "nearly large, far small, near clear and far blurred". And regardless of the distance, for the human eye, A is still A. For computers, how to simulate the distance? In the field of image processing, this thing is the scale space. Scale space refers to different degrees of blur. The specific method is to use different Gaussian convolution kernels to smooth the image, so that scale-invariant features can be found, such as the artificial feature descriptor SIFT.

In effect, the scale space seems to mimic a 3-dimensional environment, since a single image is 2-dimensional. As for why this is done, and why such a 3D environment needs to be simulated, I personally don't think much about it.

Image pyramid is the downsampling of images into different resolutions. The Gaussian pyramid is first convolved, and then down-sampled according to a fixed ratio to generate images at different scales to form an image pyramid.

2. FPN

In object detection tasks, object recognition at different scales is a big challenge. (I don't quite understand why this is needed) In addition, the paper said that most of the original object detection algorithms only use top-level features (the last layer of convolution) for prediction. The low-level feature semantic information is relatively small, but the target position is accurate; the high- level The feature semantic information is relatively rich, but the target position is relatively rough.

So if you want to introduce features of different scales, what should you do?

 Figure a: Use the image pyramid method to generate multi-scale features (Figure a). This method is to generate many images with different resolutions, and then generate their own feature maps through the ConvNet forward process for different scales. Obviously , this approach consumes a huge amount of computing and storage resources and is not very practical.

Figure b: The most commonly used structure of the CNN network, only the feature map after the last layer of convolution is used

Figure c: The practice of SSD. Starting from conv4, the feature map of each layer is used for prediction, so that the features at multiple scales are obtained. Obviously, this method does not increase any calculation, but the author of FPN said that SSD Lower-level features are not used, and lower-level features have advantages for small target recognition.

Figure d: The network architecture of FPN. As can be seen from the figure, there is an additional upsampling process, the feature map is upsampled, then fused with shallower features, and then independently predicted.

The figure shows the network structure of FPN, including three parts, Bottom-up pathway, Top-down pathway, and lateral connections.

Bottom-up pathway

In fact, it is an ordinary backbone convolutional network, which contains feature levels at different scales. In the paper, layers of the same size are called a stage. For example, resnet50 is composed of 50 layers (convolutional layers), but the size is not changed every time. According to the size, it is divided into 5 stages in total. The feature pyramid is for the stage, and the last layer of each stage forms the feature pyramid.

Top-down pathway and lateral connections

The right side of the figure is the Top-down path, which is an upsampling process. The feature map obtained by upsampling and the feature map of the same size as the Bottom-up path are merged through lateral connections. The upsampling algorithm in the paper uses an interpolation algorithm. After the merge, a 3*3 convolution is performed to generate a feature map. The author said that this is to reduce the aliasing effect of the image caused by the upsampling merge process. The enlarged part in the picture is the specific operation of merge. Finally, each layer of feature maps needs to be predicted.

3. FPN application

FPN is a passed structure, which can be applied to different backbone detection networks with only minor modifications. The paper lists the use of FPN combined with RPN.

For example, a Faster RCNN network uses ResNet as the backbone, and RPN as the region screening network.

For example, in this Resnet bottom-up network, each row represents a stage.

we denote the output of these last residual blocks as { C2, C3, C4, C5} for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of { 4, 8, 16, 32} pixels with respect to the input image. We do not include conv1 into the pyramid due to its large memory footprint.

The function of FPN is:
 
This fifinal set of feature maps is called { P2, P3, P4, P5}, corresponding to { C2, C3, C4, C5} that are respectively of the same spatial sizes.
The combination of FPN and RPN is:
Formally, we defifine the anchors to have areas of { 32*32, 64*64, 128*128, 256*256, 512*512} pixels on { P2, P3, P4, P5, P6} respectively. As in we also use anchors of multiple aspect ratios { 1:2, 1:1, 2:1} at each level. So in total there are 15 anchors over the pyramid.

Guess you like

Origin blog.csdn.net/Eyesleft_being/article/details/120989953