[Self-attention neural network] Mask Transfiner network - Interpretation of the paper

 This article is a paper of CVPR2022. International practice, first post the original text and source code:

Original paper address https://arxiv.org/pdf/2111.13673.pdf Source code address https://github.com/SysCV/transfiner

I. Overview

        Although traditional Two-Stage networks, such as Mask R-CNN, have achieved good results in instance segmentation, their masks are still relatively rough. Mask Transfiner decomposes image regions into quadtrees , and the network only processes detected error -prone tree nodes and self -corrects their errors. This enables Mask Transfiner to predict high-precision instance masks at the lowest computational cost.

 2. Related concepts

In instance segmentation, most pixel classification errors can be attributed to the loss of spatial resolution         caused by downsampling . This results in a lower resolution mask at object edges. In order to solve this problem, two concepts are proposed in this paper: Incoherent Regions and Quadtree .

        1. Information loss area

                 To describe these regions, we downsample the mask itself to simulate the information loss caused by downsampling in the network. From the above legend, we can see that the original mask is downsampled by 2 times and then upsampled by 2 times. The orange part (the red box on the original image) is the wrong point of classification. After experiments, most of the errors are in the information loss area.

                Detection of information loss areas : The lightweight detection module involved in this paper is shown in the figure below, which can effectively detect information loss areas on multi-scale feature pyramids.

The smallest features and the predicted coarse object mask predictions are concatenated (                         concat operation ) as input.

                        ① Through a fully convolutional network (FCN, consisting of four 3x3 convolutions) and a binary classifier to predict the roughest information loss mask .

                        ② The detected low-resolution mask is up-sampled (using 1x1 convolution) and fused with high-resolution features in adjacent layers .

        2. Quadtree

                In this paper, quadtrees are used to refine information loss regions in images. It concatenates the prediction masks in these two different levels of the feature pyramid. As shown below. Based on the detected information loss points, a multi-level quadtree can be constructed, with the highest-level detected feature map as the root node, and these root nodes can be mapped to four subdivided quadrants on the low-level feature map (These maps have greater resolution and local detail).

         

 3. Network structure

        The network structure of Mask Transfiner is shown in the figure below (the part belonging to the large network frame is marked with a red frame): 

         This network is based on a layered FPN (Feature Pyramid Networks-feature pyramid). The object of Mask Transfiner is not a single-level FPN feature, but the sparse feature points detected in the information loss area on the RoI feature pyramid as an input sequence. And output its corresponding segmentation label.

        1. RoI pyramid

                This paper utilizes feature maps from layers 2 to 5 in the layered feature map extracted by the backbone network. Based on the instance proposal given by the object detector, FPN is   used to extract RoI features on feature maps of three different levels of { P_i, P_{i-1}, }. P_{i-2}features to construct the RoI pyramid.

                Wherein, the calculation formula of the initial layer i is: i = \left \lfloor i_0 + log_2(\sqrt{WH}/224)) \right \rfloor, where i_0=4W and H are the width and height of the RoI.

                Low-level features contain more context and semantic information; while high-level features contain more local features.

        2. Input node sequence

                The sequence consists of three different levels of information loss nodes from the quadtree. The size of the sequence is CxN, where C is the dimension of the feature channel and N is the total number of nodes. The sequence is compressed by a Node Encoder .

        3. Node Encoder

                The node encoder will use the following four information to encode each node of the quadtree.

                        ①Fine-grained features extracted from the current level of FPN

                        ②Semantic information provided from the initial coarse mask prediction region

                        ③Relationship and distance information between nodes (encapsulated by relative position encoding in RoI)

                        ④The context information of each node and its own information

In this paper, features are extracted in the 3x3 neighborhood                 of each node and compressed using a fully connected layer. As shown in the figure below, fine-grained features, coarse segmentation cues and contextual features are first fused through a fully connected layer, and then position embeddings are added to it.

        4. Sequence Encoder and Pixel Decoder

                 Each sequence encoder has a multi-head attention module and a fully connected feed-forward neural network.

                 The Pixel Decoder is a small two-layer MLP (Multilayer Perceptron) that decodes each node's output query and predicts the final mask label.

4. Loss function

        Based on the quadtree, the loss function used in this paper is:

                L=\lambda_1L_{Detect}+\lambda_2L_{Coarse}+\lambda_3L_{Refine}+\lambda_4L_{Inc}

                        Among them, L_{Refine}it represents the L1 loss function between the predicted information loss point and the real label; it is L_{Inc}the cross-entropy loss function for detecting the information loss area; L_{Detect}it includes the positioning and classification loss of the detector; L_{Coarse}it represents the loss of the initial rough segmentation prediction.

Guess you like

Origin blog.csdn.net/weixin_37878740/article/details/130329105