The content and difference between FPN and PAN


  Both FPN and PAN are used to solve the shortcomings of the feature pyramid network (FPN) in multi-scale detection tasks in object detection. The following is a detailed introduction to their principles and differences.

FPN

  The full name of FPN is Feature Pyramid Network, which is a method for dealing with multi-scale problems proposed by FAIR in 2017. The main idea of ​​FPN is to extract target features at different scales by constructing pyramidal feature maps, thereby improving detection accuracy.
  FPN is constructed by downsampling from high-resolution feature maps and upsampling from low-resolution feature maps, concatenating them to form a pyramid. In this process, the information of each layer feature map will be fused with the feature maps of the upper and lower adjacent layers, so that the target information in the high-level feature map can be preserved, and the background information in the low-level feature map can also be supplemented by the high-level feature map. After such processing, FPN can improve the accuracy of the model on multi-scale detection tasks, and can also improve the detection speed without affecting the detection speed.

  The main idea of ​​FPN is to build feature pyramids at different levels of the image so that objects of different scales can be captured.

  The core of FPN is feature fusion, and its basic steps are as follows:

  1. The input image is passed through the convolutional neural network to obtain a series of feature maps, each feature map corresponds to a layer of the network.
  2. For shallower feature maps, an upsampling operation is performed to make them the same size as deeper feature maps. The upsampling here can be performed using methods such as interpolation.
  3. The upsampled shallower feature map is fused with the deeper feature map, and the addition operation is used here.
  4. The fused feature maps are convolved to further fuse information.
  5. Repeat steps 2-4 until all feature maps have been fused. The resulting feature pyramid contains feature maps at multiple scales, which can be used for tasks such as object detection and segmentation.
    insert image description here
    As shown in d in the figure above.

Fusion process of FPN

  In FPN, the fusion of shallow feature maps and deep feature maps is done by up-sampling and down-sampling. Specifically, FPN decomposes the deep feature map into a series of feature maps with lower resolution but higher semantics, and sums and fuses these feature maps with the corresponding upsampled shallow feature maps to finally obtain multi-scale feature maps. The specific process of fusion is as follows:

  1. Bottom-up pyramid generation: FPN first uses a network such as ResNet as the backbone network, and generates a series of feature maps from the bottom up. The resolution of each feature map is lower than that of the previous layer, but the semantics are higher.

  2. Top-down feature fusion: FPN then starts from the top of the bottom-up generated feature map sequence (i.e., the highest-resolution feature map), doubles its resolution by upsampling, and then sums the result with lower-resolution but higher-semantic feature maps of the next layer in this feature map sequence to obtain a new set of feature maps. FPN calls this process "feature up-sampling".

  3. Horizontal connection for feature fusion: Next, FPN adds the newly generated high-resolution feature map (upsampled map) and the corresponding shallow feature map (lower resolution but similar semantics) to generate a new feature map. This process is called "feature lateral connection", which can effectively transfer semantic information from low-resolution feature maps to high-resolution feature maps.

  4. Repeat steps 2 and 3: FPN reuses the same operations in steps 2 and 3, resulting in a multi-scale feature map pyramid. In this pyramid, each feature map corresponds to an input image region with different resolutions, which enables FPN to simultaneously detect objects of different scales.

  Overall, FPN achieves the fusion of shallow and deep feature maps through up-and-down sampling and lateral connection operations, thereby improving the detector's ability to detect objects of different scales. Different from PAN, FPN uses an upsampling operation, which makes the feature map generated by FPN have higher resolution, so that it can better preserve the detailed information of the target.
  The advantage of FPN is that it can naturally fuse feature maps of different scales to improve the accuracy of object detection and segmentation. The disadvantage of FPN is that it has a large amount of calculation, and it takes a long time for training and inference.

PAN

  The full name of PAN is Path Aggregation Network, which is a method for dealing with multi-scale problems proposed by Megvii in 2018.

  PAN (Path Aggregation Network) is a deep neural network architecture for image semantic segmentation. The main idea of ​​PAN is to improve the detection accuracy by aggregating feature maps from different levels so that the information in each feature map can be fully utilized. Similar to FPN, PAN is also a pyramidal feature extraction network, but it uses a bottom-up feature propagation method.
  PAN is constructed by upsampling from low-resolution feature maps and downsampling from high-resolution feature maps, concatenating them to form a path. In this process, the information of each layer of feature maps will be fused with the feature maps of the upper and lower adjacent layers, but unlike FPN, PAN will add the results of the fusion of feature maps at different levels instead of cascading. In this way, the loss of information in the cascading process can be avoided, and at the same time, more detailed information can be retained, thereby improving the detection accuracy.
  In PAN, the backbone of the network usually adopts a commonly used convolutional neural network structure such as ResNet. In the second half of the backbone network, PAN introduces a bottom-up side branch for passing low-resolution feature maps into high-resolution layers. This side branch is parallel to the backbone network and consists of a series of convolution and upsampling (i.e., deconvolution) operations to upsample low-resolution feature maps to the same resolution as high-resolution feature maps.

  When fusing feature maps of different resolutions, PAN uses a method similar to FPN, but slightly different. Specifically, in PAN, the low-resolution feature map is first up-sampled, and then stitched with the high-resolution feature map to obtain a richer feature map. Next, a convolution operation is performed on this feature map to obtain the final feature representation.

  Compared with FPN, the bottom-up feature propagation method in PAN is more efficient and can achieve better semantic segmentation results with less computing resources. At the same time, the feature fusion method in PAN also has certain advantages, which can better preserve the detailed information in the low-resolution feature map, thereby improving the accuracy of segmentation.
insert image description here
As shown in the figure, area b is an extra bottom-up path of PAN.

the difference

insert image description here

  The main difference between FPN and PAN is that the feature fusion method is different, and PAN has one more bottom-up path than FPN. FPN uses a cascade method for feature fusion, which will lose part of the detailed information during the fusion process. Therefore, for scenes that require high-precision detection, it may not perform as well as PAN. However, PAN adopts a summation method for feature fusion, which can retain more detailed information, but at the same time it will increase the amount of calculation.

Reference and pictures from

Copyright statement: This article is an original blogger article, following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.
Link to this article: https://blog.csdn.net/flyfish1986/article/details/110520667
————————————————
Copyright statement: This article is the original article of CSDN blogger "Sisyphus", following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.
Original link: https://blog.csdn.net/flyfish1986/article/details/110520667

Guess you like

Origin blog.csdn.net/G_Shengn/article/details/130552889