Neural Network Study Notes 8 - FPN Theory and Code Understanding

Series Article Directory

RPN B Station Explanation Related to Target Segmentation


foreword

The basic idea of ​​Feature Pyramid Networks (FPN) is to construct a series of images or feature maps of different scales for model training and testing, with the purpose of improving the robustness of detection algorithms for detection targets of different sizes. However, if the FPN calculation is performed directly according to the original definition, it will bring a large amount of calculation overhead. In order to reduce the amount of calculation, FPN adopts a method of multi-scale feature fusion, which can significantly improve the scale robustness of feature expression without greatly increasing the amount of calculation.

1. Pyramid structure

insert image description here

picture (a)

First, the original image is scaled to different scales to construct an image pyramid, and then different features are extracted from each scale of the image pyramid, and the number of scales needs to be predicted as many times as there are.
Advantages: It is obtained directly from the picture, with good accuracy.
Disadvantages: The larger the scale, the greater the amount of calculation, many repeated operations, low efficiency, and large memory usage

Figure (b)

The original image is convolved and pooled through the backbone, and a new feature map of different sizes is obtained on the previous output, and a pyramid is constructed in the feature space of the image. It believes that the shallow network pays more attention to detailed information, and the high-level network pays more attention to semantic information, which is more conducive to accurate detection of targets. Therefore, after multiple scalings, prediction and classification are performed on the final feature map.

Advantages: fast speed, less memory usage.
Disadvantages: For small target features, it is easy to lose, only focusing on the features of the last layer in the deep network, but ignoring the features of other layers.

Figure (c)

The original image is convolved and pooled through the backbone, and a new feature map of different sizes is obtained on the previous output, and a pyramid is constructed in the feature space of the image. However, it uses low-level features and high-level features to make predictions separately, and makes different predictions through feature maps of different scales, thereby reducing the loss of small targets.

Advantages: output the corresponding target on different layers, and output the corresponding target without going through all the layers, which is faster and improves the detection performance of the algorithm.

Disadvantages: The obtained features are not robust, they are all weak features, and are easily affected by shallower layers.

Figure (d)

FPN performs convolution and pooling operations on the original image through the backbone, obtains new feature maps of different sizes on the previous output, and constructs a pyramid in the feature space of the image. But instead of simply scaling and then making predictions like in Figure bc, it is predicted after fusion of feature maps of different scales.

A simple summary is: bottom-up, top-down, horizontal connection and convolution fusion.

Two, FPN structure

1. Local

Because the expressive capabilities of feature map features at different levels are different, shallow features mainly reflect details such as light and shade, edges, etc., while deep features reflect a richer overall structure. Using shallow features alone cannot contain overall structural information, which will weaken the expressive ability of features. The deep features themselves are constructed from the shallow features, so they contain certain special information of the shallow features. If the deep features are fused into the shallow features, some details and the overall semantics are taken into account, making the fused Features will have richer expressive power.
FPN adopts this idea and implements it. Several layers of feature maps are extracted from the feature pyramid. These layers themselves form a hierarchical relationship from shallow to deep, and then merge the deep features to the shallow layers step by step to form a new one. Feature pyramid, each layer of this new pyramid integrates shallow and deep information, and the features of each layer of the new pyramid are used for detection, so as to achieve the purpose of detecting targets of different scales. We call this method of constructing features the feature pyramid method, which utilizes the hierarchical structure of the network itself and provides an end-to-end training method based on the original image, which can be used without significantly increasing computational and memory overhead. , to achieve multi-scale object detection.
insert image description here

  1. The feature maps of different scales used in FPN need to be selected with a 2-fold relationship. Assuming that the size of the bottom feature map is 28x28, the upper layer is 14×14, and the uppermost layer is 7×7. Finally, the original left pyramid.
  2. Each feature map of the pyramid on the left will perform a 1x1 conv operation on it, and adjust the channels of different feature maps on the backbone to ensure that the channels are the same for fusion.
  3. Performing a 2x up operation on the high-level feature map is 2 times upsampling. For example, performing 2 times upsampling on the uppermost 7x7 feature map to obtain a 14x14 feature map, so as to ensure the same size as the middle layer feature map.
  4. After processing, the shape of the feature map of the highest layer and the middle layer is exactly the same, and the add operation can be performed. For example, the 7×7 feature map of the uppermost layer is 2 times upsampled, and the 14×14 feature map in the middle is 1×1. The product adjusts the channel, and the two are added to output a new pyramid layer. The new feature layer obtained can be predicted, and this layer can be used as the basis for upsampling of the next layer.

2. Overall

Combining the resnet model to show the overall structure of FPN, the model comes from Station B
insert image description here

  1. Using resnet50 as the backbone, output the original pyramid, from the original image 640×640×3 to 20×20×2048, and get four feature maps of C2, C3, C4, and C5.
  2. The four feature maps of C2, C3, C4, and C5 respectively perform a 1×1×256 convolution operation. The reason for performing this convolution operation is to ensure that the channel is the same when the two are fused. In the original paper, the channel is changed to 256. .
  3. The Upsample operation is to perform 2 times upsampling on the feature map of the upper layer, and then fuse it with the feature map of the lower layer. For example, after C5 undergoes a 1×1×256 convolution operation, it is input into Upsample for 2 times upsampling, and then combined with the 1×1× The C4 feature map of the 256 convolution operation is fused.
  4. The output result after fusion will also undergo a 3×3×256 convolution operation, and finally get P2, P3, P4, and P5. Among them, P5 is obtained by continuously performing 1×1×256 convolution operation and 3×3×256 convolution operation by C5, and fusion is not required.
  5. On the basis of P5, a 1×1×256 downsampling with a step size of 2 is performed to obtain a P6, and P6 is used and only used for the RPN part, which means that P23456 can be used for the RPN part for prediction, Fast-RCNN will only use P2345 for prediction.

the code

ResNet+FPN implementation + free whoring code

Guess you like

Origin blog.csdn.net/qq_45848817/article/details/128417039