Original Non-local Block structure diagram
1.Abstract
Problems with standard non-local:
- Too much calculation
- GPU memory usage is too high
The author proposes an asymmetric non-local neural network for semantic segmentation, which has two prominent components: an asymmetric pyramid non-local block (APNB: greatly reduces the amount of calculation and memory consumption) and an asymmetric fusion non-local block ( AFNB).
2.Introduction
Previous research has shown that
if you make full use of remote dependencies, you can improve performance.
For standard non-local blocks, as long as the output of the key branch and the value branch remains the same size, the output size of the non-local block will remain unchanged. With this in mind, if we can only sample a few representative points from the key branch and the value branch, it is possible to greatly reduce the time complexity without sacrificing performance. So change N in the picture to S (S << N).
3.Asymmetric Non-local Neural Network
3.1 Revisiting Non-local Block
- Input feature X∈R C × H × W , use three 1 × 1 convolutions Wφ, Wθ and Wγ to transform X into φ∈R Cˆ × H × W ,
θ∈R Cˆ × H × W and γ∈R Cˆ × H × W
- Flattening is the size of Cˆ × N, where N represents the total number of spatial positions, that is, N = H · W. Find the similarity matrix
V∈R N × N
- Normalize V. The normalization function f can take the form of softmax, rescaling, and none.
- For each position in γ, the output of the attention layer is
- The final output is
where Wo, also implemented by 1 × 1 convolution, is used as a weighting parameter, the original input X, and the channel size is restored from Cˆ to C.
3.2. Asymmetric Pyramid Non-local Block
Non-local networks effectively capture the remote dependencies that are critical to semantic segmentation. Standard non-local operations are very time-consuming and take up memory. Obviously, large matrix multiplication is the main reason for the inefficiency of non-local blocks.
We will change to a different number N S (S << N), the output will remain the same size, i.e.,
change to a fractional N S equivalent to the number of sampling points from the representative gamma] and θ, instead of selecting all spatial points ,As shown in Figure 1. Therefore, the computational complexity can be greatly reduced
specific description:
-
We add sampling modules Pθ and Pγ after θ and γ to sample several sparse anchor points, which are denoted as
θP∈R Cˆ × S and γP∈R Cˆ × S , respectively, where S is the number of sampled anchor points.
-
Calculate the similarity matrix VP between φ and anchor point θP:
Note that VP is an asymmetric matrix of size N × S. Then, VP obtains a unified similarity matrix through the same normalization function as the standard non-local block . -
Attention output:
This asymmetric matrix multiplication will reduce the time complexity. However, it is difficult to ensure that when S is small, performance does not decrease too much at the same time.
In order to solve the above problems, we embed pyramid pools in non-local blocks to enhance the global representation while reducing the computational overhead.
By doing so, we now come to the final formula of the asymmetric pyramid non-local block (APNB), as shown in Figure 3. An important change is to add a spatial pyramid pool module after θ and γ to sample anchors. The sampling process is clearly described in Figure 4, where several merge layers are applied after θ or γ, and then the four merge results are flattened and connected to be used as the input of the next layer.
We represent the spatial pyramid pooling module as and , where the superscript n represents the width (or height) of the output size of the pooling layer (experimentally, the width is equal to the height). In our model, we set n⊆ {1, 3, 6, 8}. Then the total number of anchor points is that the
spatial pyramid pool provides enough feature statistical information about the semantic clues of the global scene to correct the potential performance degradation due to the reduced calculation.
3.3. Asymmetric Fusion Non-local Block
The standard non-local block has only one input source, while the FNB (Fusion Non-local Block) has two input sources: the high-level feature graph Xh ∈ R Ch × Nh and the low-level feature graph Xl ∈ R Cl × Nl .
Similarly, the 1 × 1 convolution sum is used to transform Xh and Xl to
then, the matrix of similarity between φh and θl is calculated by matrix multiplication, and
then VF is normalized to obtain a unified similarity matrix
.
3.4. Network Architecture
As our backbone network, ResNet-101 removes the last two downsampling operations and uses dilated convolution to save the feature maps in the last two stages of the input image. We use AFNB to integrate the functions of Stage4 and Stage5. Subsequently, the fused features are associated with the feature map after Stage5 to avoid the situation that AFNB cannot produce accurate enhanced features.