Image Segmentation - DeepLabv3+: Encoder-Decoder with Atrous Separable Convolution for Semantic ... (ECCV 2018)

Disclaimer: This translation is only a personal study record

Article information

Summary

  Spatial pyramid pooling modules or encoder-decoder structures are used for semantic segmentation tasks in deep neural networks. The former network is able to encode multi-scale contextual information by detecting incoming features using filters or pooling operations at multiple rates and multiple effective fields of view, while the latter network can capture spatial information by gradually recovering Clearer target boundaries. In this work, we propose to combine the advantages of these two approaches. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine segmentation results, especially along object boundaries. We further explore the Xception model and apply depthwise separable convolutions to atrous spatial pyramid pooling and decoder modules, resulting in faster and stronger encoder-decoder networks. We demonstrate the effectiveness of the proposed model on the PASCAL VOC 2012 and Cityscapes datasets, achieving test set performance of 89% and 82.1% without any post-processing. Our paper is accompanied by a public reference implementation of the proposed model in Tensorflow at https://github.com/tensorflow/models/tree/master/research/deeplab.

Keywords : Semantic Image Segmentation, Spatial Pyramid Pooling, Encoder-Decoder, Depthwise Separable Convolution.

1 Introduction

Semantic segmentation [1, 2, 3, 4, 5] with the goal of assigning a semantic label to each pixel in an image is one of the fundamental topics in computer vision. Deep convolutional neural networks [6, 7, 8, 9, 10] based on fully convolutional neural networks [8, 11] outperform on benchmark tasks relying on handcrafted features [12, 13, 14, 15, 16, 17 ] system has been significantly improved. In this work, we consider two types of neural networks that use spatial pyramid pooling modules [18, 19, 20] or encoder-decoder structures [21, 22] for semantic segmentation, where the former uses pooling features at different resolutions to capture rich contextual information, while the latter is able to obtain clear object boundaries.

insert image description here

Figure 1. We improve DeepLabv3 with a spatial pyramid pooling module (a) and an encoder-decoder structure (b). The proposed model DeepLabv3+ contains rich semantic information from an encoder module, while detailed object boundaries are recovered by a simple yet effective decoder module. The encoder module allows us to extract features at arbitrary resolutions by applying atrous convolutions.

  To capture contextual information at multiple scales, DeepLabv3 [23] applies several parallel atrous convolutions at different rates (called atrous Spatial Pyramid Pooling, or ASPP for short), while PSPNet [24] performs at different grid scales pooling operation. Even though rich semantic information is encoded in the last feature map, detailed information related to object boundaries is lost due to pooling or convolutions across the network backbone. This can be mitigated by applying atrous convolutions to extract denser feature maps. However, extracting output feature maps that are 8x or even 4x smaller than the input resolution is computationally prohibitive given the design of state-of-the-art neural networks [7, 9, 10, 25, 26] and limited GPU memory. Taking ResNet-101 [25] as an example, when applying atrous convolution to extract output features that are 16 times smaller than the input resolution, features within the last 3 residual blocks (9 layers) must be expanded. Worse, if an output feature that is 8 times smaller than the input is required, 26 residual blocks (78 layers!) will be affected. Therefore, it is computationally intensive to extract denser output features for this type of model. On the other hand, the encoder-decoder model [21, 22] facilitates faster computation in the encoder path (since there are no extended features) and gradually restores sharp object boundaries in the decoder path. To combine the advantages of these two approaches, we propose to enrich the encoder modules in the encoder-decoder network by incorporating multi-scale contextual information.

  In particular, our proposed model called DeepLabv3+ extends DeepLabv3 [23] by adding a simple yet effective decoder module to recover object boundaries, as shown in Figure 1. Rich semantic information is encoded in the output of DeepLabv3, and dilated convolutions allow the density of encoder features to be controlled according to the budget of computing resources. Furthermore, the decoder module allows detailed object boundary recovery.

  Inspired by the recent success of depthwise separable convolutions [27, 28, 26, 29, 30], we also explore this operation, and by applying an Xception model [26] similar to [31] to the task of semantic segmentation, And applying dilated separable convolutions to ASPP and decoder modules improves both speed and accuracy. Finally, we demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, and achieve test set performance of 89.0% and 82.1% without any post-processing, creating new state-of-the-art.

  In summary, our contributions are:

  • We propose a novel encoder-decoder structure that employs DeepLabv3 as a powerful encoder module and a simple and effective decoder module.

  • In our structure, the resolution of the extracted encoder features can be arbitrarily controlled by dilated convolutions to trade off accuracy and running time, which is not possible in existing encoder-decoder models.

  • We use the Xception model for the segmentation task and apply depthwise separable convolutions to the ASPP module and the decoder module, resulting in a faster and stronger encoder-decoder network.

  • Our proposed model achieves state-of-the-art performance on PASCAL VOC 2012 and Cityscapes datasets. We also provide a detailed analysis of design choices and model variants.

  • We make the Tensorflow-based implementation of the proposed model publicly available at https://github.com/tensorflow/models/tree/master/research/deeplab.

2. Related work

Models based on Fully Convolutional Networks (FCN) [8, 11] have achieved significant improvements on several segmentation benchmarks [1, 2, 3, 4, 5]. Several model variants have been proposed to exploit contextual information for segmentation [12, 13, 14, 15, 16, 17, 32, 33], including those that take multi-scale inputs (i.e., image pyramids) [34, 35, 36, 37, 38, 39] or those employing probabilistic graphical models (such as DenseCRF [40] [41] with efficient inference algorithms) [42, 43, 44, 37, 45, 46, 47, 48, 49, 50, 51, 39]. In this work, we mainly discuss models using spatial pyramid pooling and encoder-decoder structures.

  Spatial Pyramid Pooling : Models such as PSPNet [24] or DeepLab [39, 23] perform spatial pyramid pooling [18, 19] (including image-level pooling [52]) at several grid scales, or apply several Parallel Atrous Convolution (called atrous Spatial pyramid pooling or ASPP). By exploiting multi-scale information, these models show promising results on several segmentation benchmarks.

  Encoder-Decoder : Encoder-decoder networks have been successfully applied to many computer vision tasks, including human pose estimation [53], object detection [54, 55, 56] and semantic segmentation [11, 57, 21, 22, 58, 59, 60, 61, 62, 63, 64]. Typically, an encoder-decoder network contains (1) an encoder module that gradually reduces feature maps and captures higher semantic information, and (2) a decoder module that gradually recovers spatial information. Based on this idea, we propose to use DeepLabv3 [23] as the encoder module, and add a simple yet effective decoder module for sharper segmentation.

insert image description here

Figure 2. Our proposed DeepLabv3+ extends DeepLabv3 by employing an encoder-decoder structure. An encoder module encodes multi-scale contextual information by applying atrous convolutions at multiple scales, while a simple yet effective decoder module refines the segmentation results along object boundaries.

  Depthwise Separable Convolution : Depthwise Separable Convolution [27, 28] or Group Convolution [7, 65], a powerful operation that reduces computational cost and number of parameters while maintaining similar (or slightly better) performance. This operation has been adopted by many recent neural network designs [66, 67, 26, 29, 30, 31, 68]. In particular, we explore the Xception model [26], similar to [31]'s COCO 2017 detection challenge submission, and show improvements in both accuracy and speed for the semantic segmentation task.

3. Method

In this section, we briefly introduce dilated convolutions [69, 70, 8, 71, 42] and depthwise separable convolutions [27, 28, 67, 26, 29]. We then review DeepLabv3 [23], which is used as our encoder module, and then discuss the decoder module appended to the encoder output. We also propose an improved Xception model [26, 31] that further improves performance through faster computation.

3.1 Codec with Atrous Convolution

Atrous Convolution : Atrous convolutional layers are a powerful tool that allow us to explicitly control the resolution of features computed by deep convolutional neural networks and adjust the field of view of filters to capture multi-scale information, which generalizes the standard convolution operation. In the case of two-dimensional signals, for each position i and convolution filter w on the output feature map y, atrous convolution is applied on the input feature map x as follows: where the atrous rate r determines how
insert image description here
we sample the input signal stride. We refer interested readers to [39] for more details. Note that standard convolution is a special case of rate r=1. Adaptively modify the filter's field of view by changing the rate value.

insert image description here

Figure 3. 3 × 3 depthwise separable convolution decomposes standard convolution into (a) depthwise convolution (applying a single filter to each input channel) and (b) pointwise convolution (combining depthwise convolutions across channels) output). In this work, we explore dilated separable convolutions, where dilated convolutions are employed in depthwise convolutions, as shown in (c) with rate = 2.

  Depth-separable convolution : Depth-separable convolution decomposes the standard convolution into a depthwise convolution, and then a point-by-point convolution (ie, 1×1 convolution), which greatly reduces the computational complexity. Specifically, depthwise convolutions perform spatial convolutions independently on each input channel, while pointwise convolutions are used to combine the outputs of depthwise convolutions. In the TensorFlow [72] implementation of depthwise separable convolutions, dilated convolutions are supported in depthwise convolutions (i.e., spatial convolutions), as shown in Figure 3. In this work, we refer to the resulting convolution as atrous-separable convolution, and find that atrous-separable convolution significantly reduces the computational complexity of the proposed model while maintaining similar (or better) performance.

  DeepLabv3 as encoder : DeepLabv3 [23] employs dilated convolutions [69, 70, 8, 71] to extract features computed by deep convolutional neural networks at arbitrary resolutions. Here, we denote the output stride as the ratio of the input image spatial resolution to the final output resolution (before global pooling or fully connected layers). For image classification tasks, the spatial resolution of the final feature map is usually 32 times the resolution of the input image, so the output stride=32. For semantic segmentation tasks, this can be achieved by removing strides in the last (or two) blocks and applying dilated convolutions accordingly (e.g. for output stride=8, we apply rate=2 and rate =4), use output stride=16 (or 8) for more intensive feature extraction. Furthermore, DeepLabv3 enhances the atrous spatial pyramid pooling module, which detects convolutional features at multiple scales by applying atrous convolutions with image-level features at different rates [52]. In our proposed encoder-decoder structure, we use the last feature map before the logits in the original DeepLabv3 as the encoder output. Note that the encoder output feature map contains 256 channels and rich semantic information. Furthermore, depending on the computational budget, features can be extracted at arbitrary resolutions by applying dilated convolutions.

  Proposed Decoder : Encoder features in DeepLabv3 are usually computed with output stride=16. In the work of [23], features are upsampled by a factor of 16 twice, which can be considered as a naive decoder module. However, such naive decoder modules may not be successful in recovering target segmentation details. Therefore, we propose a simple yet effective decoder module, as shown in Fig. 2. The encoder features are first bilinearly upsampled by a factor of 4 and then concatenated with the corresponding low-level features [73] from the network backbone with the same spatial resolution (e.g. Conv2 before striding in ResNet-101 [25] ). We apply another 1×1 convolution to the low-level features to reduce the number of channels, since the corresponding low-level features usually contain a large number of channels (e.g., 256 or 512), which may outweigh the importance of rich encoder features (in our There are only 256 channels in the model), and make training more difficult. After concatenation, we apply some 3×3 convolutions to refine the features, followed by another simple bilinear upsampling by a factor of 4. We show in Section 4 that using an output stride=16 of the encoder module achieves the best balance between speed and accuracy. Performance is slightly improved when using output stride=8 for the encoder module at the cost of additional computational complexity.

3.2 Modified alignment Xception

The Xception model [26] showed promising image classification results with fast computation on ImageNet [74]. Recently, the MSRA team [31] modified the Xception model (called Aligned Xception) and further improved the performance of the object detection task. Inspired by these findings, we work in the same direction to adapt the Xception model to the task of semantic image segmentation. In particular, we made some changes based on MSRA's modification, namely (1) the same deeper Xception as in [31], except that we did not modify the ingress flow network structure for fast computation and memory efficiency, (2 ) all max-pooling operations are replaced by depthwise separable convolutions with strides, which allow us to apply dilated separable convolutions to extract feature maps of arbitrary resolution (an alternative is to extend the dilated algorithm to max pooling operation), and (3) adding additional batch normalization [75] and ReLU activation after each 3×3 depthwise convolution, similar to the MobileNet design [29]. See Figure 4 for details.

4. Experimental evaluation

We use ImageNet-1k [74] pre-trained ResNet-101 [25] or modified aligned Xception [26, 31] to extract dense feature maps via dilated convolutions. Our implementation builds on TensorFlow [72] and is publicly available.

insert image description here

Figure 4. We modified Xception as follows: (1) more layers (same modification as MSRA except for ingress flow change), (2) all max-pooling operations are Separate convolutions are replaced, and (3) additional batch normalization and ReLU are added after each 3×3 depthwise convolution, similar to MobileNet.

  The proposed model is evaluated on the PASCAL VOC 2012 semantic segmentation benchmark [1], which contains 20 foreground object classes and one background class. The original dataset contains 1464 (training), 1449 (validation) and 1456 (testing) pixel-level annotated images. We augment the dataset with additional annotations provided by [76], resulting in 10582 (train aug) training images. Performance is measured in terms of pixel intersection over union (mIOU) averaged over 21 categories.

  We follow the same training scheme as in [23] and refer interested readers to [23] for details. In brief, we adopt the same learning rate schedule (i.e. “poly” strategy [52] and the same initial learning rate of 0.007), crop size 513×513, and fine-tune the batch normalization parameter when output stride=16 [75 ], and add random scale data during training. Note that we also include batch normalization parameters in the proposed decoder module. Our proposed model is trained end-to-end without piecewise pre-training of each component.

4.1 Decoder Design Options

We define "DeepLabv3 feature map" as the last feature map computed by DeepLabv3 (i.e., the one containing ASPP features and image-level features), and [k×k,f] as the filter with kernel k×k and f convolution operation.

  ResNet-101 based DeepLabv3 [23] bilinearly upsamples logits by 16 during training and evaluation when output stride = 16 is adopted. This simple bilinear upsampling can be considered as a naive decoder design, achieving a performance of 77.21% [23] on the PASCAL VOC 2012 validation set, and better than not using this naive decoder during training. (i.e., downsampling the ground truth during training) is 1.2% better. To improve on this naive baseline, our proposed model “DeepLabv3+” adds a decoder module on top of the encoder output, as shown in Figure 2. In the decoder module, we consider three places where different design choices are made, namely (1) 1 × 1 convolution for reducing the channels of the low-level feature maps of the encoder module, (2) for obtaining sharper segmentation 3×3 convolution of the result, and (3) what encoder low-level features should be used.

  To evaluate the effect of the 1×1 convolution in the decoder module, we used [3×3, 256] and Conv2 features from the ResNet-101 network backbone, the last feature map in the res2x residual block (specifically , we use feature maps before striding). As shown in Table 1, reducing the channels of the low-level feature maps of the encoder module to 48 or 32 leads to better performance. Therefore, we adopt [1×1, 48] for channel reduction.

  Then, we design a 3×3 convolutional structure for the decoder module and report the findings in Table 2. We found that after concatenating the Conv2 feature map (before striding) with the DeepLabv3 feature map, using two 3×3 convolutions with 256 filters is more effective than simply using one or three convolutions. Changing the number of filters from 256 to 128 or changing the kernel size from 3×3 to 1×1 reduces performance. We also experiment with utilizing both Conv2 and Conv3 feature maps in the decoder module. In this case, the decoder feature maps are progressively upsampled by 2, concatenated first with Conv3 and then with Conv2, and each feature map will be refined by [3 × 3, 256] operations. The whole decoding process is similar to the U-Net/SegNet design [21, 22]. However, we did not observe a significant improvement. So, in the end, we employ a very simple but effective decoder module: a concatenation of DeepLabv3 feature maps and channel-reduced Conv2 feature maps refined by two [3×3, 256] operations. Note that the output stride of our proposed DeepLabv3+ model is 4. With limited GPU resources, we do not pursue denser output feature maps (i.e., output stride < 4).

4.2 ResNet-101 as network backbone

To compare the model variants in terms of accuracy and speed, we report mIOU and Multiply-Adds in Table 3 when using ResNet-101 [25] as the network backbone in the proposed DeepLabv3+ model. Thanks to dilated convolutions, we are able to use a single model to obtain features at different resolutions during training and evaluation.

insert image description here

Table 1. PASCAL VOC 2012 validation set. The effect of the decoder 1×1 convolution is used to reduce the channels of the low-level feature maps from the encoder module. We fix other components in the decoder structure to use [3×3, 256] and Conv2.

insert image description here

Table 2. The effect of decoder structure when fixing [1×1, 48] to reduce the feature channel of encoder. We found that using Conv2 (before striding) feature maps with two additional [3×3, 256] operations is the most effective. Performance on the VOC 2012 validation set.

insert image description here

Table 3. Inference strategies using ResNet-101 on the PASCAL VOC 2012 validation set. train OS: The output stride used in training. eval OS: The output stride used during evaluation. Decoder: The proposed decoder structure is adopted. MS: Multiscale input during evaluation. Flip: Adds an input that flips left and right.

insert image description here

Table 4. Individual model error rates on the ImageNet-1K validation set.

  Baseline : The first row of blocks in Table 3 contains the results of [23], showing that extracting denser feature maps during evaluation (i.e. evaluation output stride=8) and taking multi-scale inputs improves performance. Furthermore, adding left-right flipped inputs doubles the computational complexity, with only a slight increase in performance.

  Adding a decoder : The second row of blocks in Table 3 contains the results when adopting the proposed decoder structure. When using eval output stride=16 or 8, the performance improves from 77.21% to 78.85% or from 78.51% to 79.35%, respectively, at the cost of about 20B of additional computational overhead. Performance is further improved when using multi-scale and left-right flipped inputs.

  Coarser feature maps : We also experimented with training output stride=32 (i.e., no dilated convolutions at all during training) for fast computation. As shown in the third row in Table 3, adding the decoder brings a 2% improvement while only requiring 74.20B Multiply-Adds. However, the performance is always about 1% to 1.5% lower than when we use train output stride=16 and different eval output stride values. Therefore, we prefer to use output stride=16 or 8 during training or evaluation, depending on the complexity budget.

4.3 Xception as the backbone of the network

We further use the more powerful Xception [26] as the network backbone. Following [31], we made some more changes, as described in Section 3.2.

  ImageNet pre-training : The proposed Xception network is pre-trained on the ImageNet-1k dataset [74] with a similar training protocol as in [26]. Specifically, we employ the Nesterov momentum optimizer with momentum=0.9, initial learning rate=0.05, rate decay=0.94 every 2 epochs, and weight decay 4e−5. We use asynchronous training with 50 GPUs, each with a batch size of 32 and an image size of 299×299. We did not try very hard to tune the hyperparameters, since the goal was to pre-train the model on ImageNet for semantic segmentation. We report the single-model error rates on the validation set in Table 4, as well as the baseline ResNet-101 [25] replicated under the same training regime. We observe that in the modified Xception, when the extra batch normalization and ReLU are not added after each 3×3 depthwise convolution, the accuracy of Top1 and Top5 drops by 0.75% and 0.29%, respectively.

  Table 5 reports the results of using the proposed Xception as the network backbone for semantic segmentation.

  Baseline : We first report the results without the proposed decoder in the first row block of Table 5. This shows that when train output stride=eval output stride=16, using Xception as the network backbone improves the performance by about 2% compared to the case of using ResNet-101. Further improvements can also be obtained by using eval output stride=8, multi-scale input during inference, and adding left-right flipped input. Note that we did not use the multi-grid approach [77, 78, 23], which we found did not improve performance.

  Adding a decoder : As shown in the second row block in Table 5, adding a decoder brings an improvement of 0.8% when using eval output stride=16 for all different inference strategies. The improvement is smaller when using eval output stride=8.

  Using Depthwise Separable Convolution : Inspired by the computational efficiency of depthwise separable convolution, we further adopt it in the ASPP and decoder modules. As shown in the third row of Table 5, the computational complexity of Multiply-Adds is significantly reduced by 33% to 41%, while achieving similar mIOU performance.

  Pre-training on COCO : For comparison with other existing models, we further pre-train our proposed DeepLabv3+ model on the MS-COCO dataset [79], which yields about 2% additional Improve.

  Pre-training on JFT : Similar to [23], we also adopt the proposed Xception model, which has been pre-trained on ImageNet-1k [74] and JFT-300M datasets [80, 26, 81] , which leads to an additional 0.8% to 1% improvement.

  Test set results : Since computational complexity was not considered in the benchmark evaluation, we selected the best performing model and trained it with output stride=8 and frozen batch normalization parameters. Finally, our “DeepLabv3+” achieves 87.8% and 89.0% performance without and with JFT dataset pre-training, respectively.

  Qualitative Results : We provide a visualization of the best model in Figure 6. As shown, our model is able to segment objects well without any post-processing.

  Failure Modes : As shown in the last row of Figure 6, our model has difficulty segmenting (a) sofas from chairs, (b) heavily occluded objects, and (c) objects with rare views.

4.4 Improvements along object boundaries

In this subsection, we evaluate segmentation accuracy using triple-mapping experiments [14, 40, 39] to quantify the accuracy of the proposed decoder module near object boundaries. Specifically, we apply morphological dilation to the "void" label annotations on the validation set, which usually occurs near object boundaries. We then compute the average IOU of those pixels that are within the dilated band (called the triplet) of the "void" label. As shown in Fig. 5(a), using the proposed decoder for the ResNet-101 [25] and Xception [26] network backbones improves performance compared to naive bilinear upsampling. The improvement was more pronounced when the dilated band was narrower. As shown in the figure, under the smallest trimap width, we observed that the mIOU of ResNet-101 and Xception increased by 4.8% and 5.4%, respectively. We also visualize the effect of using the proposed decoder in Fig. 5(b).

insert image description here

Table 5. Inference strategies on the PASCAL VOC 2012 validation set when using modified Xception. train OS: The output stride used in training. eval OS: The output stride used during evaluation. Decoder: The proposed decoder structure is adopted. MS: Multiscale input during evaluation. Flip: Adds an input that flips left and right. SC: Both ASPP and decoder modules employ depthwise separable convolutions. COCO: Model pre-trained on MS-COCO. JFT: Models pre-trained on JFT.

4.5 Experimental results of Cityscapes

In this section, we conduct experiments on DeepLabv3+ on the Cityscapes dataset [3], which is a large dataset containing 5000 images (2975, 500 and 1525 for training, validation and test sets, respectively. ) and about 20,000 roughly annotated images.

insert image description here

Table 6. Results obtained with the best performing model on the PASCAL VOC 2012 test set.

insert image description here

Figure 5. (a) mIOU as a function of triple-map bandwidth around object boundaries when employing train output stride = eval output stride = 16. BU: Bilinear upsampling. (b) Qualitative effect of using the proposed decoder module compared to naive bilinear upsampling (denoted as BU). In the example, we use Xception as the feature extractor, and train output stride=eval output stride=16.

  As shown in Table 7(a), the proposed Xception model is used as the network backbone (denoted as X-65) on top of DeepLabv3 [23], which includes the ASPP module and image-level features [52], on the validation set A performance of 77.33% was obtained. Adding the proposed decoder module significantly improves the performance to 78.79% (1.46% improvement). We note that removing augmented image-level features improves the performance to 79.14%, which indicates that image-level features are more effective on the PASCAL VOC 2012 dataset in the DeepLab model. We also found that adding more ingress flow layers to Xception is effective on the Cityscapes dataset [26], the same as [31] did for the object detection task. The final model built on top of the deeper network backbone (denoted X-71 in the table) achieves the best performance of 79.55% on the validation set.

insert image description here

Figure 6. Visualization results on the validation set. The last line shows the failure mode.

insert image description here

Table 7. (a) DeepLabv3+ on Cityscapes with training refinement settings at training time. (b) DeepLabv3+ on the Cityscapes test set. Coarse: An additional training set (coarse annotations) can also be used. Only some of the top models are listed in this table.

  After finding the best model variant on the validation set, we further fine-tune the model on the coarse annotations to compete with other state-of-the-art models. As shown in Table 7(b), our proposed DeepLabv3+ achieves a performance of 82.1% on the test set, setting a new performance level on Cityscapes.

5 Conclusion

Our proposed model "DeepLabv3+" employs an encoder-decoder structure, where DeepLabv3 is used to encode rich contextual information and employs a simple yet effective decoder module to recover object boundaries. Depending on available computing resources, atrous convolutions can also be applied to extract encoder features at arbitrary resolutions. We also explore Xception models and dilated separable convolutions to make the proposed model faster and stronger. Finally, our experimental results show that the proposed model sets new state-of-the-art performance on PASCAL VOC 2012 and Cityscapes datasets.

Acknowledgments We would like to thank Haozhi Qi and Jifeng Dai for their valuable discussions on Aligned Xception, Chen Sun for their feedback, and the Google Mobile Vision team for their support.

References

  1. Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge a retrospective. IJCV (2014)
  2. Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: CVPR. (2014)
  3. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR. (2016)
  4. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR. (2017)
  5. Caesar, H., Uijlings, J., Ferrari, V.: COCO-Stuff: Thing and stuff classes in context. In: CVPR. (2018)
  6. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proc. IEEE. (1998)
  7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. (2012)
  8. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. In: ICLR. (2014)
  9. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR. (2015)
  10. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)
  11. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015)
  12. He, X., Zemel, R.S., Carreira-Perpindn, M.: Multiscale conditional random fields for image labeling. In: CVPR. (2004)
  13. Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV (2009)
  14. Kohli, P., Torr, P.H., et al.: Robust higher order potentials for enforcing label consistency. IJCV 82(3) (2009) 302–324
  15. Ladicky, L., Russell, C., Kohli, P., Torr, P.H.: Associative hierarchical crfs for object class image segmentation. In: ICCV. (2009)
  16. Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: ICCV. (2009)
  17. Yao, J., Fidler, S., Urtasun, R.: Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In: CVPR. (2012)
  18. Grauman, K., Darrell, T.: The pyramid match kernel: Discriminative classification with sets of image features. In: ICCV. (2005)
  19. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR. (2006)
  20. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: ECCV. (2014)
  21. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI. (2015)
  22. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. PAMI (2017)
  23. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)
  24. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR. (2017)
  25. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
  26. Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: CVPR. (2017)
  27. Sifre, L.: Rigid-motion scattering for image classification. PhD thesis (2014)
  28. Vanhoucke, V.: Learning visual representations at scale. ICLR invited talk (2014)
  29. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
  30. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: CVPR. (2018)
  31. Qi, H., Zhang, Z., Xiao, B., Hu, H., Cheng, B., Wei, Y., Dai, J.: Deformable convolutional networks – coco detection and segmentation challenge 2017 entry. ICCV COCO Challenge Workshop (2017)
  32. Mostajabi, M., Yadollahpour, P., Shakhnarovich, G.: Feedforward semantic segmentation with zoom-out features. In: CVPR. (2015)
  33. Dai, J., He, K., Sun, J.: Convolutional feature masking for joint object and stuff segmentation. In: CVPR. (2015)
  34. Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. PAMI (2013)
  35. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV. (2015)
  36. Pinheiro, P., Collobert, R.: Recurrent convolutional neural networks for scene labeling. In: ICML. (2014)
  37. Lin, G., Shen, C., van den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR. (2016)
  38. Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. In: CVPR. (2016)
  39. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI (2017)
  40. Kr¨ahenb¨uhl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In: NIPS. (2011)
  41. Adams, A., Baek, J., Davis, M.A.: Fast high-dimensional filtering using the per mutohedral lattice. In: Eurographics. (2010)
  42. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR. (2015)
  43. Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: CVPR. (2015)
  44. Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.: Conditional random fields as recurrent neural networks. In: ICCV. (2015)
  45. Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: ICCV. (2015)
  46. Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.L.: Weakly- and semi-supervised learning of a dcnn for semantic image segmentation. In: ICCV. (2015)
  47. Schwing, A.G., Urtasun, R.: Fully connected deep structured networks. arXiv:1503.02351 (2015)
  48. Jampani, V., Kiefel, M., Gehler, P.V.: Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks. In: CVPR. (2016)
  49. Vemulapalli, R., Tuzel, O., Liu, M.Y., Chellappa, R.: Gaussian conditional random field network for semantic segmentation. In: CVPR. (2016)
  50. Chandra, S., Kokkinos, I.: Fast, exact and multi-scale inference for semantic image segmentation with deep Gaussian CRFs. In: ECCV. (2016)
  51. Chandra, S., Usunier, N., Kokkinos, I.: Dense and low-rank gaussian crfs using deep embeddings. In: ICCV. (2017)
  52. Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. arXiv:1506.04579 (2015)
  53. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV. (2016)
  54. Lin, T.Y., Doll´ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. (2017)
  55. Shrivastava, A., Sukthankar, R., Malik, J., Gupta, A.: Beyond skip connections: Top-down modulation for object detection. arXiv:1612.06851 (2016)
  56. Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: Dssd: Deconvolutional single shot detector. arXiv:1701.06659 (2017)
  57. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV. (2015)
  58. Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation. In: CVPR. (2017)
  59. Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: CVPR. (2017)
  60. Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve semantic segmentation by global convolutional network. In: CVPR. (2017)
  61. Islam, M.A., Rochan, M., Bruce, N.D., Wang, Y.: Gated feedback refinement network for dense image labeling. In: CVPR. (2017)
  62. Wojna, Z., Ferrari, V., Guadarrama, S., Silberman, N., Chen, L.C., Fathi, A., Uijlings, J.: The devil is in the decoder. In: BMVC. (2017)
  63. Fu, J., Liu, J., Wang, Y., Lu, H.: Stacked deconvolutional network for semantic segmentation. arXiv:1708.04943 (2017)
  64. Zhang, Z., Zhang, X., Peng, C., Cheng, D., Sun, J.: Exfuse: Enhancing feature fusion for semantic segmentation. arXiv:1804.03821 (2018)
  65. Xie, S., Girshick, R., Dollr, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR. (2017)
  66. Jin, J., Dundar, A., Culurciello, E.: Flattened convolutional neural networks for feedforward acceleration. arXiv:1412.5474 (2014)
  67. Wang, M., Liu, B., Foroosh, H.: Design of efficient convolutional layers using single intra-channel convolution, topological subdivisioning and spatial ”bottleneck” structure. arXiv:1608.04337 (2016)
  68. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR. (2018)
  69. Holschneider, M., Kronland-Martinet, R., Morlet, J., Tchamitchian, P.: A real-time algorithm for signal analysis with the help of the wavelet transform. In: Wavelets: Time-Frequency Methods and Phase Space. (1989) 289–297
  70. Giusti, A., Ciresan, D., Masci, J., Gambardella, L., Schmidhuber, J.: Fast image scanning with deep max-pooling convolutional neural networks. In: ICIP. (2013)
  71. Papandreou, G., Kokkinos, I., Savalle, P.A.: Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In: CVPR. (2015)
  72. Abadi, M., Agarwal, A., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467 (2016)
  73. Hariharan, B., Arbel´aez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR. (2015)
  74. Russakovsky , O. , Deng , J. , Su , H. , Krause , J. , Satheesh , S. , Ma , S. , Huang , Z. , Karpathy , A. , Khosla , A. , Bernstein , M. , . Berg, AC, Fei-Fei, L.: The ImageNet Large Scale Visual Recognition Challenge. IJCV (2015)
  75. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML. (2015)
  76. Hariharan, B., Arbel´aez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV. (2011)
  77. Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Understanding convolution for semantic segmentation. arXiv:1702.08502 (2017)
  78. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV. (2017)
  79. Lin, T.Y., et al.: Microsoft COCO: Common objects in context. In: ECCV. (2014)
  80. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS. (2014)
  81. Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: ICCV. (2017)
  82. Li, X., Liu, Z., Luo, P., Loy, C.C., Tang, X.: Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In: CVPR. (2017)
  83. Wu, Z., Shen, C., van den Hengel, A.: Wider or deeper: Revisiting the resnet model for visual recognition. arXiv:1611.10080 (2016)
  84. Wang, G., Luo, P., Lin, L., Wang, X.: Learning object interactions and descriptions for semantic image segmentation. In: CVPR. (2017)
  85. Luo, P., Wang, G., Lin, L., Wang, X.: Deep dual learning for semantic image segmentation. In: ICCV. (2017)
  86. Bul`o, S.R., Porzi, L., Kontschieder, P.: In-place activated batchnorm for memory-optimized training of dnns. In: CVPR. (2018)

Guess you like

Origin blog.csdn.net/i6101206007/article/details/131252260