PAN(Pyramid Attention Network for semantic segmentation)paper解读

Pyramid Attention Network for Semantic Segmentation uses PAN for semantic segmentation, and the network structure is similar to encoder-decode, u-shape.

background

The encoder-decoder structure,
in the process of encoding to high-dimensional features, the original texture information will suffer from the loss of spatial resolution, such as FCN.
PSPNet and DeepLab use spatial pyramid and hole convolution (ASPP) to deal with this problem.
However, ASPP is easy to cause grid artifacts, and spatial pyramid will lose pixel-level positioning information.
The author gets inspiration from SENet and Parsenet to extract pixel level attention information from high-dimensional features.

PAN consists of two structures, FPA (Feature Pyramid Attention) and GAU (Global Attention Upsample).
FPA is similar to the connection between encoder and decoder. Its function is to increase the receptive field and distinguish smaller targets.
GAU is similar to the upsampling of the decoder behind the FCN, and can also extract attention information from high-dimensional features, and the amount of calculation is not very large.

Related work

The PAN structure is similar to encoder-decoder, attention, and the spatial pyramid structure in PSPNet is also considered,
so similar work includes encoder-decoder, Global Context Attention, and spatial pyramid.

encoder-decoder: Not much to say about the structure, the main feature is to connect the features of adjacent stages, but it does not take into account the global feature information.
Global Context Attention: Originated from ParseNet, a global branch is applied to increase the receptive field and strengthen the consistency of pixel-wise classification.
DFN uses a global pooling branch at the top of U-shape to turn U-shape into V-shape. The author of this article also uses global average
pooling to add to the decoder branch to select distinguishing features.
Spatial Pyramid: Used to extract multi-scale information. Spatial pyramid pooling is suitable for targets with different scales. The PSPNet and DeepLab series extend global pooling to Spatial pyramid pooling and ASPP. Although the effect is good, the calculation is very heavy.

PAN

PAN includes FPA and GAU. The module is as shown in the figure below. The backbone is ResNet-101.
FPA is equivalent to the turning point of the encoder decoder.
insert image description here

FPA

The purpose is to provide pixel-wise attention for the high-level features of CNN.
In the recent semantic segmentation, the pyramid structure can extract features of different scales and increase the receptive field, but this structure lacks global information (lack of channel selection mechanism) ).
At the same time, if the vector of channel attention is selected, then multi-scale features cannot be extracted, and pixel-wise information is missing.

The author combines pixel-wise attention with multi-scale features.
Therefore, this module combines the features of three different scales with a U-shape structure. In order to extract features of different scales, the pyramid uses 3x3, 5x5, and 7x7 convolutional layers. Because high-dimensional features are used, high-dimensional feature maps are usually relatively small, so a larger convolution kernel does not bring a lot of calculation.

Then, after the input feature output by CNN passes through a 1x1 convolution, it can be multiplied pixel-wise with the feature output by FPA. It plays the role of pixel wise attention and combines multi-scale.

Added the aforementioned global branch, used global average pooling, and added it to the output feature.
The final structure obtained is as follows,
insert image description here

The author mentioned that channel reduce is done before multiplication, so it does not consume a lot of calculation like PSPNet and ASPP.

GAU

This piece belongs to the decoder. Bilinear interpolation upsampling is used in PSPNet and Deeplab, which can be regarded as a simple decoder. The
general encoder-decoder network mainly considers the features of different scales, and gradually restores the boundary of the target in the decoder. This kind of network is generally very complex and requires a large amount of calculation.

Recent studies have shown that combining CNN and pyramid can improve the effect, and the category information will also be strengthened.
The author considers using high-dimensional features plus auxiliary category information to provide weights for low-dimensional information to select accurate details.

GAU uses global average pooling to provide global information and provide weights for low-dimensional features to select category positioning details.
In detail, 3x3 convolution is performed on low-dimensional features to reduce channel (reduce the amount of calculation).
The high-dimensional features are passed through the global average pooling layer, and then a 1x1 convolution + BN + ReLU is used to obtain the weight vector. This weight is multiplied by the low-dimensional output, and the result of the multiplication is added to the original
high-dimensional features.
insert image description here

network structure

The network structure is pasted on the PAN part, let’s post it again.
insert image description here
Details:
backbone: ResNet-101, pre-trained on ImageNet,
extract feature maps with rate=2 dilated convolution on res5b block, so the size of feature maps is the input image 1/16 of that (similar to DeepLabv3+).
Replace the 7x7 convolution in ResNet-101 with three 3x3conv (similar to PSPNet)

Training details:
insert image description here

The extended version of PASCAL is used for training, and the reference data
set usage method github is as follows:
pytorch version target segmentation
pytorch version semantic segmentation

Guess you like

Origin blog.csdn.net/level_code/article/details/130821292