论文学习：Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

20180313，谷歌开源了语义图像分割模型 DeepLab-v3+。

GitHub 地址：https://github.com/tensorflow/models/tree/master/research/deeplab

===============================================================

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

空洞分离卷积编码器-解码器结构用于语义图割

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam
Google Inc.
{lcchen, yukun, gpapan, fschroff, hadam}@google.com

Abstract. Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https://github.com/tensorflow/models/tree/master/research/deeplab.

摘要：深神经网络中采用空间金字塔池化模型或编码器-解码器结构进行语义分割。前一种网络通过探测输入特征或以多比例、多有效感受野的方式池化操作来对多尺度上下文信息进行编码，后一种网络通过逐渐恢复空间信息来捕获更清晰的对象边界。在这项工作中，我们提出结合两种方法的优点。具体来说，我们提出的模型DeepLabv3+扩展了DeepLabv3，通过添加一个简单而有效的解码器模块来细化分割结果，特别是沿着对象边界。我们进一步研究了Xception模型，并将深度可分离卷积(depthwise separable convolution)应用于空洞空间金字塔池化(Atrous Spatial Pyramid Pooling)和解码器模块，从而生成一个更快、更强编码器-解码器网络。我们在PASCAL VOC 2012和Cityscapes数据集上验证了该模型的有效性，在没有任何后期处理的情况下，测试集的性能分别达到89.0%和82.1%。我们的论文附带了在Tensorflow中提出的模型的公开参考实现:https://github.com/tensorflow/models/tree/master/research/deeplab。

1 Introduction

Semantic segmentation with the goal to assign semantic labels to every pixel in an image [1,2,3,4,5] is one of the fundamental topics in computer vision. Deep convolutional neural networks [6,7,8,9,10] based on the Fully Convolutional Neural Network [8,11] show striking improvement over systems relying on hand-crafted features [12,13,14,15,16,17] on benchmark tasks. In this work, we consider two types of neural networks that use spatial pyramid pooling module [18,19,20] or encoder-decoder structure [21,22] for semantic segmentation, where the former one captures rich contextual information by pooling features at different resolution while the latter one is able to obtain sharp object boundaries.

语义分割是计算机视觉的基础课题之一，其目标是为图像中的每个像素分配语义标签[1,2,3,4,5]。深度卷积神经网络[6,7,8,9,10]基于完全卷积神经网络[8,11]，相对于依赖于手动设计的特征[12,13,14,15,16,17]的系统在基本测试任务上有显著改进。在本文中，我们考虑了两种使用空间金字塔池模块[18,19,20]或编码器解码器结构[21,22]进行语义分割的神经网络，前者通过不同分辨率的池化特征获取丰富的上下文信息，后者能够获得清晰的对象边界。

In order to capture the contextual information at multiple scales, DeepLabv3[23] applies several parallel atrous convolution with different rates (called Atrous Spatial Pyramid Pooling, or ASPP), while PSPNet [24] performs pooling operations at different grid scales. Even though rich semantic information is encoded in the last feature map, detailed information related to object boundaries is missing due to the pooling or convolutions with striding operations within the network backbone. This could be alleviated by applying the atrous convolution to extract denser feature maps. However, given the design of state-of-art neural networks[7,9,10,25,26] and limited GPU memory, it is computationally prohibitive to extract output feature maps that are 8, or even 4 times smaller than the input resolution. Taking ResNet-101 [25] for example, when applying atrous convolution to extract output features that are 16 times smaller than input resolution, features within the last 3 residual blocks (9 layers) have to be dilated. Even worse, 26 residual blocks (78 layers!) will be affected if output features that are 8 times smaller than input are desired. Thus, it is computationally intensive if denser output features are extracted for this type of models. On the other hand, encoder-decoder models [21,22] lend themselves to faster computation (since no features are dilated) in the encoder path and gradually recover sharp object boundaries in the decoder path. Attempting to combine the advantages from both methods, we propose to enrich the encoder module in the encoder-decoder networks by incorporating the multi-scale contextual information.

为了在多个尺度上捕获上下文信息，DeepLabv3[23]应用了几个具有不同比例的并行空洞卷积(Atrous Spatial Pyramid Pooling, or ASPP)，而PSPNet[24]在不同网格尺度上执行池化操作。尽管在最后一个feature map中编码了丰富的语义信息，但是由于网络主干中的池化或卷积交叉操作，与对象边界相关的详细信息丢失了。这可以通过应用空洞卷积来提取更稠密的feature maps来缓解。然而，考虑到神经网络的设计水平[7,9,10,25,26]和有限的GPU内存，提取比输入分辨率小8倍甚至4倍的输出特征图在计算上是非常困难的。以ResNet-101[25]为例，当应用全卷积提取比输入分辨率小16倍的输出特征时，必须对最后3个残块(9层)内的特征进行扩展。更糟糕的是，如果需要比输入小8倍的输出特性，则会影响26个剩余块(78层!)因此，如果为这种类型的模型提取更密集的输出特性，则计算量很大。另一方面，编码器-译码器模型[21,22]在编码器路径中具有更快的计算速度(因为没有扩展特性)，并在解码器路径中逐渐恢复清晰的对象边界。为了结合这两种方法的优点，我们提出了在编码器译码网络中加入多尺度上下文信息来丰富编码器模块。

论文学习：Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

猜你喜欢