Abstract

U-Net与residual单元结合的优势：
（1）residual units可以简化深层网络的训练
（2）大量的skip connections促进了信息的传播，可以设计出参数更好性能更好的网络。

1.INTRODUCTION

道路提取：道路面积的提取和道路中心线的提取
道路面积提取前人工作：
（1）shape index特征 + SVM【11】
（2）使用概率SVM设计多级框架【12】
（3）基于层次图（hierarchical）的无监督道路提取方法【6】
（4）使用RBMs（受限玻尔兹曼机）从高分辨率航空图像中检测道路区域（首次使用深度学习技术，有预处理和后处理）【2】
（5）使用卷积神经网络来进行道路提取，效果比RBMs好（直接从原始图像中提取道路和建筑物）【5】

Deep ResUnet和U-Net有两点不同：
（1）使用residual unit代替 plain neural units 作为基本单元
（2）移除了cropping operation

2.METHODOLOGY

A.Deep ResUnet

在语义分割中，为了得到更好的分割结果，在保留高层语义信息的同时使用低层的细节是非常重要的。训练这样的深度神经网络也是非常困难的。尤其是训练样本有限的情况下。
（1）使用预训练好的网络，然后在目标数据集上进行微调（fine-tuning）
（2）Data augmentation
作者相信U-Net本身的结构有助于缓解训练的问题（作者的直觉），因为将低层特征复制到相应的高层实际上为信息传播创建了一条路径，允许信号以更简单的方式在低层和高层之间传播，这不仅有助于训练期间的反向传播，而且可以将低层更精细的细节补偿到高层语义特征。

更深的网络可以提高性能，但是可能会妨碍训练，并且可能出现退化问题（degradation problem）。为了克服这些问题，我们使用residual unit，每一个residual unit都可以用一种通用形式表示：
在这里插入图片描述

由于batch normalization、ReLU activation和convolution layers在residual unit中有多种的组合方式。在【22】中对不同组合的影响进行了详细的讨论、并提出了full pre-activation design（如图1）。

Deep ResUnet的优势：
（1）residual可以简化网络的训练
Skip connections（低层和高层之间和 residual unit内）有助于信息的传播并且不会退化（degradation）,在降低参数数目的情况下获得更好的性能。

在这里插入图片描述
Encoding：将输入图像编码成紧凑的表示（compact representations）
Bridge：连接encoding paths 和 decoding paths
Decoding：将表示恢复成像素级别的分类

Encoding path中有三个residual units,在每个单元中，我们不使用池化操作来减少feature map的大小,而是对第一个卷积块应用stride为2的步长来讲feature map的大小减小一半。
Decoding path也由三个residual units组成，在每一个单元之前，有来自较低层feature map的上采样和对应Encoding path的feature map的连接（concatenation）。
在Decoding path中的最后一层，使用1x1卷积和sigmoid 激活函数将多通道的feature map映射到所需要的分割中。
我们使用了15个卷积层，而U-Net使用了23个卷积层，没有使用crop操作。

在这里插入图片描述

B.Loss function（MSE）

在这里插入图片描述

我们的目标是估计网络的参数W，从而产生准确和稳健的道路面积。
N：训练样本的数目
我们使用SGD来训练我们的网络，U-Net使用像素交叉熵作为损失函数对模型进行优化。

C. Result refinement

我们的输入和输出的尺寸是一样的，都是224x224。由于卷积层中使用的是0 padding，输出边界附近的像素精度比中心像素低。为了得到更好的分割结果，我们使用重叠策略（overlap strategy）来产生大图像的分割结果。输入的子图像是从原始图像中裁剪出来的，重叠度为o（在我们的实验中，o=14）。最后的结果是将所有的子片段拼接在一起得到的。重叠区域中的值是平均值。

3.EXPERIMEBTS

数据集：Massachusetts roads dataset
我们将deep ResUnet与三种最先进的方法进行了比较，Mnih的【2】方法、Saito的【5】方法和U-Net的【24】。

A.Dataset

1171 images(training:1108 validation:14 testing:49)
1500 x 1500 分辨率：1.2米/像素

B.Implementation details

使用框架：Keras
优化算法：SGD
有1108张1500×1500大小的训练图像可供训练。理论上，我们的网络可以以任意大小的图像作为输入，但是需要大量的GPU内存来存储feature maps。在这封信中，我们使用固定大小的训练图像（224×224，如表1所示）来训练模型。这些训练图像是从原始图像中随机抽取的。最后，生成30000个样本，并将其输入网络学习参数。应该注意的是，在训练期间没有使用数据增强。
我们开始在NVIDIA Titan 1080 GPU上以8的小批量训练该模型。学习率初始化为0.001，每20个epochs降低0.1倍。网络将在50个epochs时收敛。

C.Evaluation metrics

评价二值分类方法最常用的指标是精确性（precision）和召回率（recall）。在遥感（remote sensing）中，这些度量也被称为正确性（correctness）和完整性（completeness）。Precision是标记为道路的预测道路像素的分数（fraction of predicted road pixels which
are labeled as roads），recall是正确预测的所有标记道路像素的分数（fraction of all the labeled road pixels that are correctly predicted）。

由于难以正确标记所有道路像素，Mnih等人【2】在道路提取中引入了松弛精度（relaxed precision）和召回分数(relaxed recall)【26】。Relaxed precision定义为从标记为道路的像素中在ρ像素范围内预测为道路的像素数的分数。Relaxed recall是预测为道路的像素中ρ像素范围内标记为道路的像素数的分数。
The relaxed precision is defined as the fraction of number of pixels predicted as road within a range of ρ pixels from pixels labeled as road. The relaxed recall is the fraction of number of pixels labeled as road that are within a range of ρ pixels from pixels predicted as road.
在本实验中，松弛参数(slack parameter)ρ设为3，这与以往的研究一致。我们还报告了不同方法的盈亏平衡点（break-even points）。盈亏平衡点定义为松弛精度召回曲线(relaxed precision-recall curve)上其精度值等于其召回值的点。换句话说，盈亏平衡点是精确召回曲线(precision-recall curve)和y=x线的交点。

D.Comparisons

在马萨诸塞州道路数据集的测试集上，对三种基于深度学习的道路提取方法进行了比较。表二列出了所提出方法和比较方法的盈亏平衡点。图3给出了U-Net和我们的网络的松弛精度召回曲线及其盈亏平衡点，以及比较方法的盈亏平衡点。可以看出，我们的方法在松弛精度和召回率方面都优于其他三种方法。虽然我们的网络参数仅为U-Net的1/4（7.8M对30.6M），但在道路提取任务方面取得了有希望的改进。
在这里插入图片描述

图4示出了Saito等人的四个示例结果，U-Net和本文提出的ResUnet。可以看出，与其他两种方法相比，我们的方法在噪声较小的情况下显示出更干净的结果。特别是当有两车道道路时，我们的方法可以高置信度地分割每条车道，而其他方法可能会混淆车道，如图4的第三行所示。同样，在相交区域，我们的方法也会产生更好的结果。

在分析具有复杂结构的对象时，上下文信息非常重要。我们的网络考虑了道路的上下文信息，因此可以将道路与建筑物屋顶、机场跑道等类似对象区分开来。从图4的第一行可以看出，即使跑道具有与高速公路非常相似的特征，我们的方法也可以成功地从跑道中分割出侧道。除此之外，上下文信息还使其对遮挡具有鲁棒性。例如，第二行矩形上的部分道路被树木覆盖。Saito方法和U-Net不能检测树下的道路，但是我们的方法成功地标记了它们。失败案例显示在最后一行的黄色矩形中。我们的方法错过了停车场的路。这主要是因为停车场的大部分道路没有贴标签。因此，尽管这些道路与普通道路具有相同的特征，我们的网络将它们视为背景信息。

REFERENCES

[1] X. Huang and L. Zhang, “Road centreline extraction from highresolution
imagery based on multiscale structural features and support vector
machines,” IJRS, vol. 30, no. 8, pp. 1977–1987, 2009.
[2] V. Mnih and G. Hinton, “Learning to detect roads in high-resolution
aerial images,” ECCV, pp. 210–223, 2010.
[3] C. Unsalan and B. Sirmacek, “Road network detection using probabilistic and graph theoretical methods,” TGRS, vol. 50, no. 11, pp. 4441–
4453, 2012.
[4] G. Cheng, Y. Wang, Y. Gong, F. Zhu, and C. Pan, “Urban road extraction
via graph cuts based probability propagation,” in ICIP, 2015, pp. 5072–
5076.
[5] S. Saito, T. Yamashita, and Y. Aoki, “Multiple object extraction from
aerial imagery with convolutional neural networks,” J. ELECTRON
IMAGING, vol. 2016, no. 10, pp. 1–9, 2016.
[6] R. Alshehhi and P. R. Marpu, “Hierarchical graph-based segmentation
for extracting road networks from high-resolution satellite images,”
P&RS, vol. 126, pp. 245–260, 2017.
[7] B. Liu, H. Wu, Y. Wang, and W. Liu, “Main road extraction from ZY-3
grayscale imagery based on directional mathematical morphology and
VGI prior knowledge in urban areas,” PLOS ONE, vol. 10, no. 9, p.
e0138071, 2015.
[8] C. Sujatha and D. Selvathi, “Connected component-based technique
for automatic extraction of road centerline in high resolution satellite
images,” J. Image Video Process., vol. 2015, no. 1, p. 8, 2015.
[9] G. Cheng, Y. Wang, S. Xu, H. Wang, S. Xiang, and C. Pan, “Automatic road detection and centerline extraction via cascaded end-to-end
convolutional neural network,” TGRS, vol. 55, no. 6, pp. 3322–3337,
2017.
[10] G. Cheng, F. Zhu, S. Xiang, and C. Pan, “Road centerline extraction via
semisupervised segmentation and multidirection nonmaximum suppression,” GRSL, vol. 13, no. 4, pp. 545–549, 2016.
[11] M. Song and D. Civco, “Road extraction using SVM and image
segmentation,” PE&RS, vol. 70, no. 12, pp. 1365–1371, 2004.
[12] S. Das, T. T. Mirnalinee, and K. Varghese, “Use of salient features for the
design of a multistage framework to extract roads from high-resolution
multispectral satellite images,” TGRS, vol. 49, no. 10, pp. 3906–3931,
2011.
[13] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning
deep features for scene recognition using places database,” in NIPS,
2014, pp. 487–495.
[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards realtime object detection with region proposal networks,” TPAMI, vol. 39,
no. 6, p. 1137, 2017.
[15] V. Mnih and G. E. Hinton, “Learning to label aerial images from noisy
data,” in ICML, 2012, pp. 567–574.
[16] Q. Zhang, Y. Wang, Q. Liu, X. Liu, and W. Wang, “CNN based suburban building detection using monocular high resolution google earth images,” in IGARSS, 2016, pp. 661–664.
[17] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing data:
A technical tutorial on the state of the art,” Geosci. Remote Sens. Mag.,
vol. 4, no. 2, pp. 22–40, 2016.
[18] Z. Zhang, Y. Wang, Q. Liu, L. Li, and P. Wang, “A CNN based functional
zone classification method for aerial images,” in IGARSS, 2016, pp.
5449–5452.
[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in CVPR, 2015, pp. 1–9.
[20] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv:1409.1556, 2014.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in CVPR, 2016, pp. 770–778.
[22] ——, “Identity mappings in deep residual networks,” in ECCV, 2016,
pp. 630–645.
[23] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
[24] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in MICCAI, 2015, pp. 234–241.
[25] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
[26] M. Ehrig and J. Euzenat, “Relaxed precision and recall for ontology
matching,” in Workshop on Integrating ontology, 2005, pp. 25–32.