Encoder-Decoder with Atrous Separable Convolution for Semantic ImageSegmentation Paper Intensive Reading

A Separable Atrous Convolutional Codec for Semantic Image Segmentation

Summary

Deep neural networks employ spatial pyramid pooling modules or codec structures for semantic segmentation. The former network is able to encode multi-scale contextual information by probing incoming features at multiple rates and multiple effective fields of view, while the latter network can capture sharper object boundaries by gradually recovering spatial information. In this work, we propose to combine the advantages of both approaches. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple but effective decoder module to refine segmentation results, especially along object boundaries. We further explore outlier models and apply depthwise separable convolutions to atrous spatial pyramid pooling and decoder modules, leading to faster and stronger encoder-decoder networks. We validate the effectiveness of the model on the PASCAL VOC 2012 semantic image segmentation dataset, achieving 89% performance on the test set without any post-processing. Our paper is accompanied by a public reference implementation of the proposed model in Tensorflow.

1. Introduction

The goal of semantic segmentation is to assign a semantic label to each pixel in an image [17, 52, 13, 83, 5], which is one of the fundamental topics in computer vision. Deep convolutional neural networks [41, 38, 64, 68, 70] based on fully convolutional neural networks [64, 49] perform better than those relying on handcrafted features [28, 65, 36, 39, 22, 79] on benchmark tasks. The system has been significantly improved. In this work, we consider two types of neural networks for semantic segmentation using spatial pyramid pooling modules (23, 40, 26) or encoder-decoder structures (61, 3), where the former features, while the latter can obtain clear object boundaries to capture rich contextual information.

To capture contextual information at multiple scales, DeepLabv3 [10] employs several parallel atrous convolutions of different rates (called Atrous Spatial Pyramid Pooling, or ASPP for short), while PSPNet [81] performs at different grid scales pool operation. Although rich semantic information is encoded in the last feature map layer, detailed information related to object boundaries is lost due to convolution and pooling operations with strides larger than 1 within the network backbone. This can be mitigated by applying atrous convolutions to extract denser feature maps. However, considering the design of state-of-the-art neural networks [38, 68, 70, 27, 12] and limited GPU memory, it is computationally prohibitive to extract output feature maps that are 8× or even 4× smaller than the input resolution. Taking ResNet-101 [27] as an example, when applying atrous convolution to extract output features that are 16 times smaller than the input resolution, the features within the last 3 residual blocks (9 layers) must be enlarged. Worse, there are 26 residual blocks (78 layers!) that suffer if the output features are required to be 8 times smaller than the input. Therefore, if more dense output features are extracted for such models, the computation will be more expensive. On the other hand, the encoder-decoder model [61, 3] helps to speed up computation in the encoder path (since no features are expanded) and gradually restores sharp object boundaries in the decoder path. To combine the advantages of these two approaches, we propose to enrich the encoder module in the encoder-decoder network by incorporating multi-scale contextual information.

In particular, our proposed model named DeepLabv3+ extends DeepLabv3 [10] by adding a simple but effective decoder module to recover object boundaries, as shown in Figure 1. Rich semantic information is encoded in the output of DeepLabv3, and the density of encoder features can be controlled according to the budget of computing resources through Atos convolution. Furthermore, the decoder module allows detailed object boundary recovery.

Figure 1. We propose to improve DeepLabv3 with a spatial pyramid pooling module (a) with an encoder-decoder structure (b). The proposed model DeepLabv3+ contains rich semantic information from an encoder module, while detailed object boundaries are recovered by a simple but effective decoder module. The encoder module allows us to extract features at arbitrary resolutions by applying Atos convolutions.

Inspired by the recent success of depthwise separable convolutions [67, 71, 12, 31, 80], we also explore this operation and improve on the task of semantic segmentation by adopting an anomaly model [12] similar to [60], This results in improvements in speed and accuracy, as well as the application of atrous separable convolutions to the ASPP and decoder modules. Finally, we validate the effectiveness of the model on the PASCAL VOC 2012 semantic segmentation benchmark, achieving a performance of 89.0% on the test set without any post-processing, breaking new state-of-the-art.

In summary, our contributions are:

• We propose a new encoder-decoder structure that uses DeepLabv3 as a powerful encoder module and a simple but effective decoder module. • In our proposed encoder-decoder structure, the resolution of the extracted encoder features can be arbitrarily controlled by Atos convolutions to trade off accuracy and running time, which is not possible in existing encoder models of. • We tune the anomaly model for the segmentation task and apply depthwise separable convolutions to the ASPP module and the decoder module, resulting in a faster and stronger encoder-decoder network. • Our proposed model achieves state-of-the-art performance on the PASCAL VOC 2012 dataset. We also provide a detailed analysis of design choices and model variants. • We expose the implementation of the proposed model based on Tensorflow.

2. Related work

Fully convolutional network (FCN) based models [64, 49] have demonstrated significant improvements over several segmentation benchmarks [17, 52, 13, 83, 5]. Several model variants have been proposed to leverage contextual information for segmentation [28, 65, 36, 39, 22, 79, 51, 14], including those that take multi-scale inputs (i.e., image pyramids) [18, 16, 58, 44 ,11,9] Orthogonal sets adopt probabilistic graphical models (such as DenseCRF [37] and efficient inference algorithms [2]) [8,4,82,44,48,55,63,34,72,6,7,9 ]. In this work, we mainly discuss models using spatial pyramid pooling and codec structures.

Spatial Pyramid Pool:

Models such as PSPNet [81] or DeepLab [9, 10] perform spatial pyramid pooling [23, 40] (including image-level pooling [47] ) at multiple grid scales, or apply multiple parallel shrinking volumes at different rates pooling (called Shrinking Spatial Pyramid Pooling, or ASPP). By exploiting multi-scale information, these models show promising results on several segmentation benchmarks.

Encoder-Decoder:

Encoder-decoder networks have been successfully applied to many computer vision tasks, including human pose estimation [53], object detection [45, 66, 19] and semantic segmentation [49, 54, 61, 3, 43, 59, 57, 33,76,20]. Typically, an encoder-decoder network contains (1) an encoder module that gradually reduces feature maps and captures higher semantic information, and (2) a decoder module that gradually recovers spatial information. Based on this idea, we propose to use DeepLabv3 [10] as the encoder module and add a simple but effective decoder module to obtain sharper segmentation.

Depthwise separable convolution:

Depthwise separable convolution [67, 71] or group convolution [38, 78], a powerful operation that reduces computational cost and number of parameters while maintaining similar (or slightly better) performance. This operation has been employed in many recent neural network designs. In particular, we explored an anomaly model [12] similar to [60] for submission to the COCO 2017 detection challenge and showed improvements in both accuracy and speed on the semantic segmentation task.

3. Method

In this section, we briefly introduce shrinking convolutions [30, 21, 64, 56, 8] and depthwise separable convolutions [67, 71, 74, 12, 31]. We then review DeepLabv3 [10] used as an encoder module before discussing the proposed decoder module attached to the encoder output. We also propose an improved anomaly model [12, 60], which further improves performance through faster computation.

3.1. Atrous Convolutional Encoder-Decoder

Atrous convolution:

Atrus convolution is a powerful tool that allows us to explicitly control the resolution of the features computed by deep convolutional neural networks and adjust the field of view of the filter to capture multi-scale information, it generalizes the standard convolution operation . Specifically, in the case of a two-dimensional signal, for each position i on the output feature map y and the convolution filter w, an Atos convolution is applied to the input feature map x as follows:

 Among them, the decay rate r determines the stride at which we sample the input signal. We refer interested readers to [9] for more details. Note that standard convolution is a special case of rate r=1. Adaptively modifies the field of view of the filter by changing the rate value.

Depthwise separable convolution:

Depth separable convolution decomposes the standard convolution into depth convolution, and then performs point-by-point convolution (ie 1×1 convolution), which greatly reduces the computational complexity. Specifically, depthwise convolutions perform spatial convolutions independently on each input channel, while pointwise convolutions are used to combine the outputs of depthwise convolutions. In the depthwise separable convolution implemented by TensorFlow [1], depthwise convolution (ie, spatial convolution) supports Atos convolution. In this work, we refer to the resulting convolutions as Atos-separable convolutions, and find that Atos-separable convolutions significantly reduce the computation of the proposed model while maintaining similar (or better) performance. the complexity.

DeepLabv3 as encoder:

DeepLabv3 [10] employs Atlas convolution [30, 21, 64, 56] to extract features computed by deep convolutional neural networks at arbitrary resolutions. Here, we denote the output stride as the ratio of the input image spatial resolution to the final output resolution (before global pooling or fully connected layers). For image classification tasks, the spatial resolution of the final feature map is usually 32 times smaller than the input image resolution, so the output stride=32. For semantic segmentation tasks, output stride = 16 (or 8) can be used for denser feature extraction by removing strides in the Lastone (or two) blocks and applying thermal convolution accordingly (e.g., our The last two blocks with output stride=8 apply rate=2 and rate=4 respectively). Furthermore, DeepLabv3 enhances the Atrous Spatial Pyramid Pooling module, which detects convolutional features at multiple scales with image-level features by applying Atrus convolutions at different rates [47]. In our proposed encoder-decoder structure, we use the last feature map before the logits in the original DeepLabv3 as the encoder output. Note: The encoder output feature map contains 256 channels and rich semantic information. Furthermore, depending on the computational budget, features can be extracted at arbitrary resolutions by applying Atos convolutions.

Proposed decoder:

Encoder features from DeepLabv3 are usually computed with output stride=16. In the work of [10], features are bilinearly upsampled by a factor of 16, which can be considered as a naive decoder module. However, such simple decoder modules may not succeed in recovering object segmentation details. Therefore, we propose a simple but effective decoder module, as shown in Fig. 2. Encoder features are first bilinearly upsampled by a factor of 4 and then concatenated with corresponding low-level features [25] at the same spatial resolution in the network backbone (e.g. Conv2 before striding in ResNet-101 [27] ) . We apply another 1 × 1 convolution on the low-level features to reduce the number of channels, because the corresponding low-level features usually contain a large number of channels (e.g. 256 or 512), which may exceed the rich encoder features (only 256 in our model channels) and make training more difficult. After concatenation, we apply some 3×3 convolutions to refine the features, followed by a simple bilinear upsampling with a factor of 4. We display in seconds. 4. Encoder module uses output stride=16 to achieve the best balance between speed and accuracy. When the encoder module uses output stride=8, the performance improves slightly, but at the cost of additional computational complexity.

 figure 2. Our proposed DeepLabv3+ extends DeepLabv3 by employing an encoder-decoder structure. An encoder module encodes multi-scale contextual information by applying Atos convolutions at multiple scales, while a simple yet effective decoder module refines the segmentation results along object boundaries.

3.2. Correct alignment exception

In [60], except that we did not modify the ingress flow network structure for fast computation and memory efficiency, (2) all max-pooling operations are replaced by depthwise separable convolutions with strides, which allow us to apply Atrus separable convolutions extract feature maps at arbitrary resolutions (another option is to extend the Atrus algorithm to a max pooling operation), and (3) add an additional batch normalization [32] and Reactivation, similar to the MobileNet design [31]. The modified exception structure is shown in Figure 3. In [60], except that we did not modify the ingress flow network structure for fast computation and memory efficiency, (2) all max-pooling operations are replaced by depthwise separable convolutions with strides, which allow us to apply Atrus separable convolutions extract feature maps at arbitrary resolutions (another option is to extend the Atrus algorithm to a max pooling operation), and (3) add an additional batch normalization [32] and Reactivation, similar to the MobileNet design [31]. The modified exception structure is shown in Figure 3.

 Figure 3. Modifications to the anomaly model are as follows: (1) more layers (same modification as MSRA, except for a change in the ingress flow), (2) all max-pooling operations are replaced by depthwise separable convolutions with strides, and (3) adding additional batch normalization and ReLU after every 3×3 depthwise convolution, similar to MobileNet.

Guess you like

Origin blog.csdn.net/XDH19910113/article/details/123262272