PSANet: Point-wise Spatial Attention Network for Scene Parsing_PSANet: Point-wise Spatial Attention Network for Scene Parsing

0. Summary

        Due to the physical design limitations of convolutional filters, we noticed that the information flow in convolutional neural networks only flows within local neighborhoods, which limits the overall understanding of complex scenes. In this paper, we propose Pointwise Spatial Attention Network (PSANet) to relax the restriction of local neighborhoods. Each location on the feature map is connected to all other locations via an adaptively learned attention mask. In addition, we also enable two-way information dissemination for scene analysis. Information from other locations can aid predictions at the current location, and vice versa, information from the current location can be distributed to other locations to aid their predictions. Our proposed method achieves state-of-the-art performance on multiple competing scene parsing datasets, including ADE20K, PASCAL VOC 2012, and Cityscapes, demonstrating its effectiveness and generality.

Keywords: point-wise spatial attention, bidirectional information flow, adaptive context aggregation, scene parsing, semantic segmentation.

1 Introduction

        Scene parsing, also known as semantic segmentation, is a fundamental and challenging problem in computer vision that assigns each pixel to a category label. It is a key step in achieving visual scene understanding and plays an important role in applications such as autonomous driving and robot navigation. The development of powerful deep convolutional neural networks (CNN) has achieved significant progress in scene parsing [26, 1, 29, 4, 5, 45]. Due to the design of the CNN structure, its receptive field is limited to local areas [47, 27]. The limited receptive field has a serious adverse impact on fully convolutional network (FCN)-based scene parsing systems because of its insufficient understanding of surrounding contextual information.

        To solve this problem, especially the use of long -range dependence, some modifications have been carried out. [4,42] proposed aggregation of contextual information through dilated convolution. Dilation is introduced into the classic compact convolution module to enlarge the receptive field. The aggregation of contextual information can also be achieved through pooling operations. The global pooling module in ParseNet [24], the atrous spatial pyramid pooling (ASPP) module based on different expansion rates in DeepLab [5], and the pyramid pooling module (PPM) based on different regions in PSPNet [45] can be used in a certain manner. extract contextual information. Different from these extensions, conditional random fields (CRF) [4, 46, 2, 3] and Markov random fields (MRF) [25] are also utilized. Furthermore, Recurrent Neural Network (RNN) [38] is introduced in ReSeg because of its ability to capture long-range dependencies. However, these dilated convolution-based [4, 42] and pooling-based [24, 5, 45] extensions exploit the homogeneous context dependence of image regions in a non-adaptive manner, ignoring the local representation and context of different categories. dependency differences. The expansion efficiency based on CRF/MRF [4, 46, 2, 3, 25] and based on RNN [38] is low and not as efficient as the CNN-based framework.

        In this paper, we propose Point-level Spatial Attention Network (PSANet) to aggregate long-range contextual information in a flexible and adaptive manner. Each location in the feature map is connected to all other locations through an adaptively predicted attention map, thereby collecting various information near and far. In addition, we design a two-way information propagation path to fully understand complex scenarios. Each location collects information from all other locations to aid its own predictions, and vice versa, each location's information can be distributed globally to aid the predictions of all other locations. Finally, the bidirectionally summarized contextual information is fused with local features to form the final representation of the complex scene. Our PSANet achieves the best performance on three competitive semantic segmentation datasets, namely ADE20K [48], PASCAL VOC 2012 [9] and Cityscapes [8]. We believe that our proposed point-level spatial attention module as well as the bidirectional information propagation paradigm can also benefit other dense prediction tasks. We provide all implementation details and make the code and trained models publicly available to the community.1 Our main contributions are in three aspects: - Long-range context aggregation for scene parsing is achieved through learned point-level position-sensitive context dependencies and two-way information propagation paradigm. - We propose Point-level Spatial Attention Network (PSANet) to extract contextual information from all locations in feature maps. Each location is connected to all other locations via an adaptively learned attention map. - PSANet achieved the best performance on various competitive scene parsing datasets, proving its effectiveness and generalizability.

2.Related work

        Recently, CNN-based methods [26, 4, 5, 42, 45, 6] have achieved significant success in scene parsing and semantic segmentation tasks. FCN [26] is the first method to replace the fully connected layers in classification networks with convolutional layers for semantic segmentation. DeconvNet [29] and SegNet [1] adopt an encoder-decoder structure that utilizes low-level information to help refine segmentation masks. Dilated convolution [4, 42] applies skip convolution on feature maps to expand the receptive field of the network. UNet [33] splices low-level output with high-level output for information fusion. DeepLab [4] and CRF-RNN [46] utilize CRF for structure prediction in scene parsing. DPN [25] utilizes MRF for semantic segmentation. LRR [11] and RefineNet [21] adopt stepwise reconstruction and refinement methods to obtain parsing results. PSPNet [45] achieves high performance through pyramid pooling strategy. There are also efficient frameworks like ENet [30] and ICNet [44] for real-time applications such as autonomous driving.

        Contextual information aggregation. Contextual information plays a key role in image understanding. Dilated convolution [4, 42] inserts dilation into the classic convolution kernel to expand the receptive field of CNN. Global pooling is widely adopted in various basic classification backbone networks [19, 35, 36, 13, 14] to obtain contextual information of global representation. Liu et al. proposed ParseNet [24], which uses global pooling to aggregate contextual information for scene parsing. Chen et al. developed the ASPP [5] module and Zhao et al. proposed the PPM [45] module to obtain contextual information in different areas. Visin et al. proposed ReSeg [38], which uses RNN to capture long-range context dependence information.

        Attention mechanism. Attention mechanism is widely used in neural networks. Mnih et al. [28] learned an attention model that adaptively selects a series of regions or locations for processing. Chen et al. [7] learned several attention masks for fusing feature maps or predictions from different branches. Vaswani et al. [37] learned a self-attention model for machine translation. Wang et al. [40] obtained the attention mask by calculating the correlation matrix between each spatial point in the feature map. Our point-level attention mask is different from the above studies. Specifically, the masks learned through our PSA module are adaptive and sensitive to location and category information. PSA learns the ability to adaptively aggregate contextual information for each point.

3.Framework

        To capture contextual information, especially in long-range context, information aggregation is very important for scene parsing [24, 5, 45, 38]. In this paper, we formalize the information aggregation step as an information flow and propose to adaptively learn pixel-level global attention maps for each location from two perspectives to aggregate contextual information over the entire feature map.

3.1.Formulation

Model general feature learning or information aggregation as

        Where zi is the newly aggregated feature at position i, and xi is the feature representation at position i in the input feature map X. ∀j∈Ω(i) enumerates all positions in the region of interest related to i, and Δij represents the relative position between positions i and j. F(xi,xj,Δij) can be any function or parameter learned based on operations, which represents the flow of information from j to i. Note that by considering the relative position Δij, F(xi,xj,Δij) is sensitive to different relative positions. N here is used for normalization. Specifically, we simplify the formula and design different functions F for different relative positions. Formula (1) is updated to...        Where {FΔij} is a set of functions at specific positions. It models the flow of information from location j to location i. Please note that the function FΔij(·,·) takes source information and target information as input. When there are many locations in the feature map, the number of combinations (xi, xj) is very large. In this article, we simplify the formula and make an approximation. First, we simplify the function FΔij(·,·) to...

In this approximation, the information flow from j to i is only related to the semantic features of the target position i and the relative positions of i and j. Based on formula (3), we rewrite formula (2) as...where the information flow from j to i is only relative to the semantic features of the source position j and the positions i and j Location related. Ultimately, we decompose and simplify the function into a two-way information propagation path. Combining formula (3) and formula (5), we get...For the first term, FΔij(xi) encodes the role of features at other positions in prediction. Each location "collects" information from other locations. For the second term, FΔij(xj) predicts the importance of features at one location to other locations. Each location "distributes" information to other locations. Figure 1 illustrates this two-way information propagation path, enabling the network to learn a more comprehensive representation, which is supported by evidence in our experimental section. Specifically, our PSA module aims to adaptively predict the information transfer over the entire feature map, considering all locations in the feature map as Ω(i), and utilizing convolutional layers as FΔij(xi) and FΔ Operation of ij(xj). Both FΔij(xi) and FΔij(xj) can be viewed as predicted attention values ​​for aggregating features xj. We further rewrite formula (7) as...where ac i, c i,j and ad i,j represent the point-level attention maps from the "collect" and "distribute" branches respectively Predicted attention values ​​in Ac and Ad. Figure 1. Schematic diagram of the two-way information dissemination model. Each location "collects" and "distributes" information to achieve more comprehensive information dissemination.

3.2.Overview

        We show the framework of the PSA module in Figure 2. The PSA module takes the spatial feature map X as input. We denote the spatial size of X as H×W. A pixel-level global attention map for each feature map location is generated through two branches, passing through several convolutional layers. According to Equation (8), we aggregate the input feature maps according to the attention map and generate new feature representations combined with long-range contextual information, namely Zc from the “collect” branch and Zd from the “distribute” branch. We concatenate the new representations Zc and Zd and apply a convolutional layer with batch normalization and activation layers for dimensionality reduction and feature fusion. Then, we concatenate the new global context features with local representation features X. This is followed by one or more convolutional layers with batch normalization and activation layers, producing final feature maps for subsequent sub-networks. It is important to note that all operations in our proposed PSA module are differentiable and can be jointly trained end-to-end with other parts of the network. It can be flexibly attached to any feature map in the network. By predicting context dependencies for each location, it is able to adaptively aggregate appropriate context information. In the following subsections, we introduce in detail the process of generating two attention maps Ac and Ad.

Figure 2. Architecture of the proposed PSA module

3.3. Stationary spatial attention

        Network structure. The PSA module first generates two point-level spatial attention maps, namely Ac and Ad, through two parallel branches. Although they represent different information propagation directions, the network structure is exactly the same. As shown in Figure 2, in each branch, we first use a 1×1 convolution layer to reduce the number of channels of the input feature map X to reduce the computational overhead (i.e., C2 < C1 in Figure 2). Then, a 1×1 convolutional layer is applied for feature adaptation. These layers all use batch normalization and activation layers. Finally, a convolutional layer is responsible for generating the global attention map for each location.

        Rather than predicting a map of size H × W for each location i, we predict an overcomplete map, i.e., of size (2H-1) × (2W-1), covering the input feature map. The result is that for the feature map As shown in Figure 3, for each position i, hi can be reshaped into a spatial map centered at position i with 2H-1 rows and 2W-1 columns, where only H×W values ​​are useful in feature aggregation . In Figure 3, the valid area is highlighted with a dashed bounding box. With our instantiation, the set of filters used to predict attention maps for different locations are not the same. This enables the network to be sensitive to relative position by adjusting the weights. Another instantiation method to achieve this goal is to use a fully connected layer to concatenate the input feature map and the predicted pixel-level attention map. But this results in a huge number of parameters.

        Attention map generation. Based on the predicted overcomplete graph Hc from the “collect” branch and Hd from the “distribute” branch, we further generate attention maps Ac and Ad. In the “collect” branch, for each position i, with k rows and l columns, we predict the correlation of the current position with other positions based on the features at position i. Therefore, ac i corresponds to the area of ​​H rows and W columns starting from (H-k) rows (W-l) columns in hc i. Specifically, in the attention mask ac i, the element in the sth row and the tth column, that is, ac [k,l]is         of row and column t index. This attention map helps collect informative content from other locations to benefit predictions at the current location. On the other hand, we distribute information about the current location to other locations. At each location, we predict the importance of the current location to information from other locations. ad i is generated in a similar way to ac i. This attention map helps distribute information for better predictions. These two attention maps encode contextual dependencies between different position pairs in a complementary manner, enabling improved information propagation and enhanced utilization of long-range context. The benefits of utilizing these two different kinds of attention were demonstrated experimentally. Figure 3. Illustration of point-level spatial attention. Figure 4. Network structure combining the PSA module with the ResNet-FCN backbone network. Employ deep supervision for better performance.

3.4. Combination of PSA module and FCN

        ​​​​ Our PSA module is extensible and can be attached to any stage of the FCN structure. We show our instantiation in Figure 4. Given an input image I, we obtain its local representation through FCN as a feature map X, which is the input of the PSA module. Same as the method in [45], we use ResNet [13] as the backbone network of FCN. Our proposed PSA module is used to aggregate long-range contextual information from local representations. It follows the fifth stage of ResNet, which is the final stage of the FCN backbone network. The features in the fifth stage have stronger semantics. Aggregating them together provides a more comprehensive representation of long-range context. In addition, the feature map of the fifth stage has a smaller spatial size, which can reduce computational overhead and memory consumption. Referring to [45], we also adopt the same deep supervision technique. In Figure 4, in addition to the main loss, an auxiliary loss branch is also applied.

3.5. Discussion

        There have been studies on using contextual information for scene parsing. However, the widely used atrous convolution [4, 42] utilizes a fixed sparse grid to operate feature maps, losing the ability to utilize the entire image information. While the pooling strategy [24, 5, 45] captures the global context with a fixed weight at each position, but cannot adapt to the input data and has less flexibility. The recently proposed non-local method [40] encodes global context by computing the correlation of semantic features between each pair of locations on the input feature map, ignoring the relative position between these two locations. Different from these solutions, our PSA module adaptively predicts a global attention map for each position on the input feature map through convolutional layers, taking into account relative positions. Furthermore, attention maps can be predicted from two different perspectives to capture different types of information flow between locations. As shown in Figure 1, these two attention maps actually construct a two-way information propagation path. They collect and distribute information across the feature map. In this regard, the global pooling technique is just a special case of our PSA module. Therefore, our PSA module is able to effectively capture long-range contextual information, adapt to the input data and utilize diverse attention information, thereby achieving more accurate predictions.

4. Experimental evaluation

        The proposed PSANet is effective in scene parsing and semantic segmentation tasks. We evaluate our method on three challenging datasets, including the complex scene understanding dataset ADE20K [48], the object segmentation dataset PASCAL VOC 2012 [9] and the urban scene understanding dataset Cityscapes [8]. Next, we first show the implementation details related to the training strategy and hyperparameters, and then show the results on the corresponding datasets and visualize the learning masks generated by the PSA module.

4.1. Implementation details

        We conduct experiments based on Caffe [15]. During training, we set the mini-batch size to 16, use simultaneous batch normalization, and set the base learning rate to 0.01. Following previous work [5, 45], we adopt a “poly” learning rate strategy with the power set to 0.9. We set the maximum number of iterations to 150K for experiments on the ADE20K dataset, 30K for the VOC 2012 dataset, and 90K for the Cityscapes dataset. Momentum and weight decay are set to 0.9 and 0.0001 respectively. For data augmentation, we employ random mirroring and random resizing for all datasets, with sizes ranging from 0.5 to 2. We also added additional random rotations ranging from -10 to 10 degrees and random Gaussian blur to the ADE20K and VOC 2012 datasets.

4.2. ADE20K

        The scene parsing dataset ADE20K [48] has up to 150 categories and diverse complex scenes, including 1038 image-level categories. The dataset is divided into 20K/2K/3K for training, validation and testing. For this dataset, objects and scenes need to be parsed. In terms of evaluation indicators, the average inter-category intersection and union ratio (Mean IoU) and pixel-level accuracy (Pixel Acc.) are used.

        Comparison of information aggregation methods. We compare the performance of several different information aggregation methods on two different network backbones (ResNet-50 and ResNet-101) on the validation set of ADE20K. The experimental results are listed in Table 1. Our baseline network is ResNet-based FCN, in which atrous convolution modules are introduced in stages 4 and 5, that is, the hole rates of these two stages are set to 2 and 4 respectively.

        Based on the feature maps extracted by FCN, DenseCRF [18] can only bring slight improvement. Global pooling [24] is a simple and intuitive attempt to obtain long-range contextual information, but it treats every location on the feature map the same way. The pyramid structure [5, 45] can capture contextual information at different scales through multiple branches. Another option is to use an attention mask for each location in the feature map. A non-local approach is adopted in [40], where an attention mask for each location is generated by computing feature correlations between each pair of locations. In our PSA module, in addition to generating a unique attention mask for each point, our point-level masks are adaptively learned through convolution operations instead of simply adopting the non-local method in [40] matrix multiplication. Compared with these information aggregation methods, our method performs better, which indicates that the PSA module is a better choice in capturing long-range contextual information.

        We further explored two branches in the PSA module. Taking ResNet50 as an example, in Table 1, for the information flow in "collect" mode (marked as "+COLLECT"), our single-scale test results reached 41.27/79.74% in terms of Mean IoU and pixel accuracy respectively. It exceeds the baseline method by 4.04/1.73 percentage points. This significant improvement demonstrates the effectiveness of our proposed PSA module even in the simplified version with only one-way information flow. In our two-way information flow model (labeled "+COLLECT +DISTRIBUTE"), the performance is further improved to 41.92/80.17%, exceeding the baseline model by 4.69/2.16 percentage points in absolute improvement and 12.60/in relative improvement. 2.77 percentage points. This improvement is only for the backbone network. This shows that both information dissemination paths are effective and complementary. Additionally, note that our position-sensitive mask generation strategy plays a key role in our high performance. The method marked "(compact)" means using compact mask generation of size H×W instead of using an overcomplete mask of double size, ignoring relevant position information. Performance is higher if relative position is taken into account. However, the performance of the "compact" method is better than that of the "non-local" method, which also shows that the long-range dependence learned adaptively from the feature map is better than that calculated from the feature correlation.

        Method comparison. We show the comparison of our method with other methods in Table 2. Under the same network backbone, PSANet has higher performance than RefineNet [21] and PSPNet [45]. PSANet50 outperforms RefineNet with ResNet-152 as the backbone even when the backbone network is deeper. It slightly outperforms WiderNet [41] which uses a powerful backbone called Wider ResNet. Visual improvements. We show a visual comparison of the segmentation results in Figure 5. PSANet significantly improves segmentation quality, producing more accurate and detailed predictions than without the PSA module. We have included more visual comparisons between PSANet and other methods in the supplementary material.

Table 1. Contextual information aggregation under different methods. Results are reported on the validation set of the ADE20K dataset. “SS” stands for single-scale testing, and “MS” stands for multi-scale testing strategy. Table 2. Comparison of methods reported on the ADE20K validation set.

Table 3. Comparison of methods reported on the VOC 2012 test set.

4.3.PASCAL VOC2012

ASCAL VOC 2012 semantic segmentation dataset [9] is an object-oriented segmentation task and contains 20 object categories and one background category. We adopted the augmented annotations from [12] for training, validation and testing, resulting in 10,582, 1,449 and 1,456 images respectively. As shown in Table 4, the PSA module we introduced is also very effective for object segmentation. It significantly improves performance, far exceeding baseline methods. Similar to the methods of [4, 5, 45, 6], we also pre-train on the MS-COCO [23] dataset and then fine-tune on the VOC dataset. Table 3 lists the performance of different frameworks on the VOC 2012 test set, and PSANet achieves the best performance. Visual improvements are evident, as shown in the additional material. Similarly, introducing the PSA module can lead to better prediction results. Figure 5. Visual improvement on the ADE20K validation set. The proposed PSANet obtains more accurate and detailed parsing results. “PSA-COL” represents PSANet with the “COLLECT” branch, and “PSA-COL-DIS” represents the bidirectional information flow mode, which further enhances predictions.

Table 4. Improvements introduced by the PSA module. The results are trained on the training extension set of the VOC 2012 data set and evaluated on the validation set.

Table 5. Improvements introduced by the PSA module. The results are trained and evaluated on the training and validation sets of the Cityscapes dataset.

Table 6. Comparison of reported methods on the Cityscapes test set. Methods trained using fine-grained and coarse data are marked with †.

4.4.Cityscapes

                The Cityscapes dataset [8] is collected for urban scene understanding. It contains 5,000 finely annotated images, divided into 2,975, 500, and 1,525 images for training, validation, and testing. This dataset is annotated with 30 common categories such as roads, pedestrians, and cars, and 19 of them are used for semantic segmentation evaluation. Additionally, another 20,000 roughly annotated images are provided. We first show the improvement brought by the PSA module over the baseline method in Table 5, and then list the comparison of different methods on the test set in Table 6, which includes two settings: training with only fine data and using Coarse + fine data for training. PSANet achieves the best performance in both settings. Some visual prediction results are also included in the supplementary material.

4.5. Mask visualization

        To gain a deeper understanding of our PSA module, we visualize the learned attention mask in Figure 6. These images are from the validation set of ADE20k. For each input image, we show the resulting mask at two points (red and blue), represented in red and blue respectively. For each point, we display the mask generated by the "COLLECT" and "DISTRIBUTE" branches. We found that the attention mask resulted in lower attention at the current location. This is reasonable because the aggregated feature representation has been concatenated with the original local features, which already contains local information. We find that our attention mask effectively focuses attention on relevant regions for better performance. For example, in the first row, the mask for red points is on the beach, assigning greater weight to ocean and sand, which helps predictions for red points. While in the sky, the attention mask for blue points assigns higher weight to other sky areas. Similar trends were observed in other images. The visualized masks confirm the design intuition of our module that each location can collect informative contextual information from nearby and distant regions to obtain better prediction results. Figure 6. Visualization of the masks learned by PSANet. Masks are sensitive to location and category information, collecting different contextual information.

5. Summary and conclusion

        We propose a PSA module suitable for scene analysis. Through convolutional layers, it adaptively predicts two global attention maps for each location in the feature map. Through this module, location-specific two-way information propagation is achieved for better performance. By aggregating information with global attention maps, long-distance contextual information can be effectively captured. Extensive experiments on three challenging datasets show that our proposed method ranks among the best in scene parsing performance, demonstrating its effectiveness and generalizability. We believe that this proposed module can promote the development of related technologies in the community.

Guess you like

Origin blog.csdn.net/ADICDFHL/article/details/133924996