[Neural Network] 2016-ICLR-Multi-scale context aggregation through dilated convolution

Multi-scale context aggregation via dilated convolutions

Paper address
code address

Summary

 State-of-the-art 语义分割models are based on adaptations originally designed for image classification 卷积网络. However, dense prediction problems such as semantic segmentation are structurally different from image classification. In this work, we develop a new convolutional network module 密集预测designed specifically for. The proposed module is used 扩张卷积to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module improves the accuracy of state-of-the-art semantic segmentation systems. Furthermore, we examine the adaptability of image classification networks to dense predictions and show that simplifying the adaptive network improves accuracy.

1 Introduction

 Many natural problems in computer vision are examples of dense prediction. The goal is to compute a discrete or continuous label for each pixel in the image . A prominent example is semantic segmentation, which requires classifying each pixel into a given set of categories (He et al, 2004; Shotton et al, 2009; Kohli et al, 2009; Krahenb¨uhl & Koltun, 2011) . Semantic segmentation is challenging because it requires combining pixel-level accuracy with multi-scale contextual reasoning (He et al, 2004; Galleguillos & Belongie, 2010).

 Recently, significant accuracy improvements were obtained in semantic segmentation by using convolutional networks (LeCun et al, 1989) trained by backpropagation (Rumelhart et al, 1986) . Specifically, Long et al. (2015) showed that a convolutional network architecture originally developed for image classification can be successfully repurposed for dense prediction. These repurposed networks significantly outperform existing techniques on challenging semantic segmentation benchmarks. This raises new questions arising from the structural differences between image classification and dense prediction. Which aspects of the network are truly necessary to repurpose, and which aspects degrade accuracy when operated intensively ? Could specialized modules designed specifically for dense predictions further improve accuracy ?

Modern image classification networks integrate multi-scale contextual information through successive pooling and downsampling layers that reduce resolution until global predictions are obtained (Krizhevsky et al, 2012; Simonyan & Zisserman, 2015). In contrast, dense prediction requires combining full-resolution output for multi-scale contextual reasoning . Recent work has investigated two approaches to handle the conflicting demands of multi-scale inference and full-resolution dense prediction. One approach involves repeated upward convolutions aiming to recover lost resolution while inheriting the global perspective from downsampling layers (Noh et al, 2015; Fischer et al, 2015). This leaves an open question as to whether strict intermediate downsampling is really necessary . Another approach involves providing multiple rescaled versions of an image as input to the network and combining the predictions obtained for these multiple inputs (Farabet et al, 2013; Lin et al, 2015; Chen et al, 2015b). Likewise, it is not clear whether separate analysis of the rescaled input images is really necessary.

 In this work, we develop a convolutional network module that aggregates multi-scale contextual information without losing resolution or analyzing rescaled images . The module can be plugged into existing architectures at any resolution . Unlike the pyramid-shaped architecture inherited from image classification, the presented context module is designed for dense prediction. It is a convolutional layer of rectangular prisms without pooling or downsampling. This module is based on dilated convolutions , enabling exponential expansion of the receptive field without loss of resolution or coverage.

 As part of this work, we also re-examined the performance of repurposed image classification networks for semantic segmentation . The performance of core prediction modules can be inadvertently obscured by increasingly complex systems involving structured predictions, multi-column schemas, multiple training data sets, and other enhancements. Therefore, we investigate the main adaptability of deep image classification networks in controlled environments and remove residual components that hinder dense prediction performance. The result is an initial prediction module that is simpler and more accurate than previous adjustments.

 Using a simplified prediction module, we evaluate the presented context network through controlled experiments on the Pascal VOC 2012 dataset (Everingham et al, 2010). Experiments show that inserting context modules into existing semantic segmentation architectures can reliably improve their accuracy .

2. Dilated convolution

 令 F   :   Z 2 → R F\ : \ \mathbb{Z}^2\rightarrow\mathbb{R} F : Z2R is a discrete function. LetΩr = [ − r , r ] 2 ∩ Z 2 \Omega_r=\left[-r,\ r\right]^2\cap\mathbb{Z}^2Ohr=[r, r]2Z2And letk : Ω r → R be of size ( 2 r + 1 ) 2 k\ : \ \Omega_r\rightarrow\mathbb{R} be of size \left(2r+1\right)^2k : OhrR is of size( 2 r+1)2 discrete filters. Discrete convolution operator∗ \ast can be defined as

official 1

 We now generalize this operator. Let lll is the expansion factor, let∗ l \ast_lldefined as

official 2

 We will ∗ l \ast_llcalled dilated convolution or lll - dilated convolution. Familiar discrete convolution∗ \ast is1 11 -dimensional convolution.

 The dilated convolution operator was known as "convolution with dilated filter" in the past. It plays a key role in the algorithm a trous, a wavelet decomposition algorithm (Holschneider et al, 1987; Shensa, 1992). We use the term "dilated convolution" instead of "convolution with dilation filter" to clarify that "dilation filter" is not constructed or meant. The convolution operator itself was modified to use the filter parameters differently. The dilated convolution operator can apply the same filter at different ranges using different dilation factors . Our definition reflects a correct implementation of the dilated convolution operator, which does not involve the construction of a dilated filter.

 In recent work on convolutional networks for semantic segmentation, Long et al. (2015) analyzed filter dilation but chose not to use it. Chen et al. (2015a) use dilation to simplify the architecture of Long et al. (2015). In contrast, we develop a new convolutional network architecture that systematically uses dilated convolutions for multi-scale context aggregation.

Our architecture is motivated by the fact that dilated convolutions enable exponential expansion of the receptive field without losing resolution or coverage . Let F 0 , F 1 , . . . , F n − 1 : Z 2 → R F_0,F_1,.\ .\ .,F_{n-1}\ : \ \mathbb{Z}^2\rightarrow\mathbb {R}F0,F1,. . .,Fn1 : Z2R is a discrete function, letk 0 , k 1 , . . . , kn − 2 : Ω 1 → R k_0,k_1,.\ .\ .,k_{n-2}\ : \ \Omega_1\rightarrow\mathbb{ R}k0,k1,. . .,kn2 : Oh1R is a discrete 3×3 filter. Consider applying a filter with exponential growth expansion:

official 3

 将 F i + 1 F_{i+1} Fi+1Medium element p \mathbf{p}The receptive field of p is defined asF 0 F_0F0Modify F i + 1 ( p ) F_{i+1}\left(\mathbf{p}\right)Fi+1A collection of elements with ( p ) values. LetF i + 1 F_{i+1}Fi+1© p \mathbf{p}The size of the receptive field of p is the number of these elements. It is not difficult to see thatF i + 1 F_{i+1}Fi+1The receptive field size of each element in is ( 2 i + 1 − 1 ) × ( 2 i + 2 − 1 ) \left(2^{i+1}-1\right)\times\left(2^{i +2}-1\right)(2i+11)×(2i+21 ) . The receptive field is an exponentially growing square. As shown in Figure 1.

figure 1

Figure 1: System expansion supports exponential expansion of the receptive field without loss of resolution or coverage. (a) F1 is generated from F0 through 1-dilated convolution; each element in F1 has a 3×3 receptive field. (b) F2 is generated from F1 through 2-dilated convolution; each element in F2 has a 7×7 receptive field. (c) F3 is generated from F2 through 4-dilated convolution; each element in F3 has a 15×15 receptive field. The number of parameters associated with each layer is the same. The receptive field grows exponentially, while the number of parameters grows linearly.

3. Multi-scale context aggregation

The context module aims to improve the performance of dense prediction architectures by aggregating multi-scale contextual information . This module will CCC feature map as input and generateCCC feature map as output. The inputs and outputs have the same form, so the module can be plugged into existing dense prediction architectures.

 We first describe the basic form of the context module. In this basic form, each layer has CCC channels. The representation in each layer is identical and can be used to directly obtain dense per-class predictions, although the feature maps are not normalized and there is no loss defined inside the module. Intuitively,this module can improve the accuracy of feature maps by passing them to multiple layers that expose contextual information.

 The basic context module has 7 layers and applies 3×3 convolutions with different dilation factors. The expansions are 1, 1, 2, 4, 8, 16 and 1. Each convolution operates on all layers: strictly speaking, these are 3 × 3 × C 3\times3\times C3×3×C convolution with dilation in the first two dimensions. Each of these convolutions is followed by a point-wise truncationmax ( ⋅ , 0 ) max(·,0)max(⋅,0 ) . The last layer executes1 × 1 × C 1\times1\times C1×1×C convolves and produces the output of the module. The architecture is summarized in Table 1. Note that in our experiments, the front-end module that provides input to the context network generates feature maps at 64×64 resolution. Therefore, we stop the exponential expansion of the receptive field after layer 6.

 Our initial attempts to train context modules failed to improve prediction accuracy. Experiments show that standard initialization procedures do not easily support module training . Convolutional networks are typically initialized using samples from a random distribution (Glorot & Bengio, 2010; Krizhevsky et al, 2012; Simonyan & Zisserman, 2015). However, we found that the random initialization scheme was ineffective for context modules. We found an alternative initialization with clear semantics that is more efficient :

official 4

 Among them aaa is the index of the input feature map,bbb is the index of the output feature map. This is a form of identity initialization that has recently been advocated for recurrent networks (Le et al, 2015). This initialization sets up all filters so that each layer passes input directly to the next layer. A natural concern is that this initialization may put the network in a mode where backpropagation cannot significantly improve the default behavior of simply passing information. However, experiments show that this is not the case. Backpropagation reliably collects contextual information provided by the network to improve the accuracy of the processed feature maps.

Table 1

Table 1: Context network architecture. The network processes C feature maps by aggregating contextual information at gradually increasing scales without losing resolution.

 This completes the presentation of the basic context network. Our experiments show that even this basic module can quantitatively and qualitatively improve the accuracy of dense predictions. This is especially noteworthy given the small number of parameters in the network: ≈ 64 C 2 \approx64C^2 in total64C2 parameters.

We also train a larger context network that uses more feature maps at deeper layers . Table 1 summarizes the number of feature maps in large networks. We generalize the initialization scheme to account for differences in the number of feature maps in different layers. Let ci c_iciand ci + 1 c_{i+1}ci+1is the number of feature maps in two consecutive layers. Assuming CCC dividesci c_iciand ci + 1 c_{i+1}ci+1. initialization is

official 5

 Definition ∼ N ( 0 , σ 2 ) \varepsilon\sim\mathcal{N}\left(0,\\sigma^2\right)eN(0, p2) σ ≪ C / c i + 1 \sigma\ll C/c_{i+1} pC/ci+1. The use of random noise breaks the connection between feature maps with common predecessors.

4. Front-end

 We implemented and trained a front-end prediction module that takes color images as input and generates C=21 C=21C=21 feature maps are used as output. The front-end module follows the work of Long et al. (2015) and Chen et al. (2015a) but is implemented separately. Weadapted the VGG-16 network(Simonyan & Zisserman, 2015) for dense prediction andremoved the last two pooling and striding layers. Specifically, each of these pooling and spanning layers is removed, and for each ablated pooling layer, the convolutions in all subsequent layers are dilated by a factor of 2. Therefore, the convolutions in the last layer after the two ablative pooling layers are enlarged by a factor of 4. This allows initialization using the parameters of the original classification network, but producing higher resolution output. The front-end module takes padding images as input and generates feature maps with a resolution of 64×64. We use reflection padding:we fill the buffer by reflecting the image at each edge.

Our front-end module is obtained by removing residues in the classification network that are not conducive to dense predictions . Most importantly, we completely remove the last two pooling layers and the strided layer, while Long et al. retain them, while Chen et al. replace the strided layer with dilation but retain the pooling layer. We found that simplifying the network by removing the pooling layer made it more accurate . We also remove the padding of the intermediate feature maps. Intermediate padding is used in original classification networks, but is neither necessary nor justified in dense predictions .

 This simplified prediction module is trained on the Pascal VOC 2012 training set and augmented with annotations created by Hariharan et al. (2011). We did not use images from the VOC-2012 validation set for training and thus only used a subset of the annotations from Hariharan et al. (2011). Training is performed via stochastic gradient descent (SGD) with a mini-batch size of 14, a learning rate of 10 -3 and a momentum of 0.9. The network is trained for 60K iterations.

 We now compare the accuracy of our front-end module with the FCN-8s design of Long et al. (2015) and the DeepLab network of Chen et al. (2015a). For FCN-8s and DeepLab, we evaluate public models trained on VOC-2012 by the original authors. Figure 2 shows the segmentations produced by different models on images from the VOC-2012 dataset. Table 2 reports the accuracy of the model on the VOC-2012 test set.

 Our front-end prediction module is simpler and more accurate than previous models. Specifically, our simplified model outperforms FCN-8s and DeepLab networks by more than 5 percentage points on the test set. Interestingly, without using CRF, our simplified front-end module outperforms DeepLab + CRF in leaderboard accuracy on the test set by more than one percentage point (67.6% vs. 66.4%).

figure 2

Figure 2: Semantic segmentation produced by different adaptations of the VGG-16 classification network. From left to right: (a) input image, (b) FCN-8s prediction (Long et al, 2015), (c) DeepLab prediction (Chen et al, 2015a), (d) our simplified front-end module prediction, ( e) ground truth.

Table 2

Table 2: Our front-end prediction module is simpler and more accurate than previous models. This table reports the accuracy on the VOC-2012 test set.

5. Experiment

 Our implementation is based on the Caffe library (Jia et al, 2014). Our implementation of dilated convolutions is now part of the standard Caffe distribution.

 For a fair comparison with recent high-performance systems, we trained a front-end module with the same structure as described in Section 4 but trained on additional images from the Microsoft COCO dataset (Lin et al. , 2014). We use all images from Microsoft COCO with at least one object from the VOC-2012 category. Annotated objects from other categories are considered background.

 Training is conducted in two phases. In the first stage, we train on VOC-2012 images and Microsoft COCO images together. Training is performed by SGD with mini-batch size 14 and momentum 0.9. Perform 100K iterations with a learning rate of 10 -3 and 40K subsequent iterations with a learning rate of 10 -4 . In the second stage, we fine-tune the network only on VOC-2012 images. Fine-tuning was performed for 50K iterations with a learning rate of 10-5 . Images from the VOC-2012 validation set were not used for training.

 The front-end module trained through this process achieved an average IoU of 69.8% on the VOC-2012 validation set and an average IoU of 71.3% on the test set. Note that this level of accuracy is achieved by the frontend alone, without context modules or structured predictions . We again attribute this high accuracy in part to the removal of residual components originally developed for image classification rather than dense prediction.

Controlled evaluation of context aggregation . We now perform controlled experiments to evaluate the utility of the context network introduced in Section 3. We start by inserting two context modules (Basic and Large) into the frontend respectively. Since the receptive field of the context network is 67×67, we fill the input feature map with a buffer of width 33. Zero padding and reflective padding produced similar results in our experiments . The context module accepts feature maps from the front-end as input and obtains this input during training. In our experiments, joint training of context modules and front-end modules did not produce significant improvements . The learning rate is set to 10 -3 . Initialize training as described in Section 3.

 Table 3 shows the effect of adding context modules to three different semantic segmentation architectures. The first architecture (top) is the front-end described in Section 4. It performs semantic segmentation without structured prediction, similar to the original work of Long et al. (2015). The second architecture (Table 3, middle) uses dense CRFs to perform structured predictions, similar to the system of Chen et al. (2015a). We use the implementation of Krahenb¨uhl & Koltun (2011) and train the CRF parameters by performing a grid search on the validation set. The third architecture (Table 3, bottom) uses CRF-RNN for structured prediction (Zheng et al., 2015). We use the implementation of Zheng et al. (2015) and train CRF-RNN in each condition.

Experimental results show that the context module improves the accuracy in each of the three configurations . Basic context module improves the accuracy of every configuration. The large context module significantly improves accuracy. Experiments show that context modules and structured predictions are synergistic: context modules improve accuracy with or without subsequent structured predictions. Qualitative results are shown in Figure 3.

image 3

Figure 3: Semantic segmentation produced by different models. From left to right: (a) input image, (b) predictions from the front-end module, (c) predictions from the large context network plugged into the front-end, (d) predictions from the front-end + context module + CRF-RNN, (e) ground truth .

table 3

Table 3: Comparative evaluation of the impact of the context module on the accuracy of three different semantic segmentation architectures. Experiments performed on the VOC-2012 validation set. Validation images were not used for training. Above: Adding context modules to a semantic segmentation front-end without structured prediction (Long et al, 2015). Basic context modules improve accuracy and large modules improve greater accuracy. Middle: The context module improves accuracy when inserted into a frontend + dense CRF configuration (Chen et al, 2015a). Bottom: Context module improves accuracy when plugged into frontend + CRF-RNN configuration (Zheng et al, 2015).

Evaluation on the test set . We now evaluate the test set by submitting the results to the Pascal VOC 2012 evaluation server. The results are reported in Table 4. We conduct these experiments using a large context module. As the results show, the context module provides significant improvements in front-end accuracy. A separate context module, without subsequent structured predictions, outperforms DeepLab-CRF-COCO-LargeFOV (Chen et al, 2015a). The context module with dense CRF, using the original implementation of Krahenb¨uhl & Koltun (2011), performs on par with the recent CRF-RNN (Zheng et al, 2015). The context module is combined with CRF-RNN to further improve the accuracy of CRF-RNN performance .

6 Conclusion

We have examined convolutional network architectures for dense prediction. Since the model must produce high-resolution output, we believe that operating at high resolution throughout the network is both feasible and desirable. Our work shows that the dilated convolution operator is particularly suitable for dense predictions due to its ability to expand the receptive field without losing resolution or coverage. We utilize dilated convolutions to design a new network structure that reliably improves accuracy when plugged into existing semantic segmentation systems. As part of this work, we also show that the accuracy of existing convolutional networks for semantic segmentation can be improved by removing residual components developed for image classification.

We believe that the proposed work is a step towards a dedicated architecture for dense prediction that is not constrained by image classification precursors. As new data sources emerge, future architectures may be intensively trained end-to-end, eliminating the need for pre-training on image classification datasets. This allows for architectural simplification and unification. Specifically, end-to-end dense training can enable fully dense architectures similar to the presented context network to always run at full resolution, accept raw images as input, and generate dense label assignments at full resolution as output

State-of-the-art semantic segmentation systems leave huge room for future development. Figure 4 shows the failure case for our most accurate configuration. We will release our code and trained models to support progress in the field

Figure 4

Figure 4: Failure cases from the VOC-2012 validation set. The most accurate architecture we trained (Context + CRF-RNN) performed poorly on these images.

Table 4

Table 4: Evaluation on the VOC-2012 test set. “DeepLab++” stands for DeepLab-CRF-COCOLargeFOV, and “DeepLab-MSc++” stands for DeepLab-MSc-CRF-LargeFOV-COCO-CrossJoint (Chen et al, 2015a). “CRF-RNN” is the system of Zheng et al. (2015). "Context" refers to the large context module that plugs into our frontend. The context network produces very high accuracy, and we implement the DeepLab++ architecture without performing structured predictions. Combining the context network with the CRF-RNN structured prediction module improves the accuracy of the CRF-RNN system.

references

Badrinarayanan, Vijay, Handa, Ankur, and Cipolla, Roberto. SegNet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv:1505.07293, 2015.
Brostow, Gabriel J., Fauqueur, Julien, and Cipolla, Roberto. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2), 2009.
Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015a.
Chen, Liang-Chieh, Yang, Yi, Wang, Jiang, Xu, Wei, and Yuille, Alan L. Attention to scale: Scale-aware semantic image segmentation. arXiv:1511.03339, 2015b.
Cordts, Marius, Omran, Mohamed, Ramos, Sebastian, Rehfeld, Timo, Enzweiler, Markus, Benenson, Rodrigo, Franke, Uwe, Roth, Stefan, and Schiele, Bernt. The Cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
Everingham, Mark, Gool, Luc J. Van, Williams, Christopher K. I., Winn, John M., and Zisserman, Andrew. The Pascal visual object classes (VOC) challenge. IJCV, 88(2), 2010.
Farabet, Clement, Couprie, Camille, Najman, Laurent, and LeCun, Yann. Learning hierarchical features for ´ scene labeling. PAMI, 35(8), 2013.
Fischer, Philipp, Dosovitskiy, Alexey, Ilg, Eddy, Hausser, Philip, Hazrba, Caner, Golkov, Vladimir, van der ¨ Smagt, Patrick, Cremers, Daniel, and Brox, Thomas. Learning optical flow with convolutional neural networks. In ICCV, 2015.
Galleguillos, Carolina and Belongie, Serge J. Context based object categorization: A critical survey. Computer Vision and Image Understanding, 114(6), 2010.
Geiger, Andreas, Lenz, Philip, Stiller, Christoph, and Urtasun, Raquel. Vision meets robotics: The KITTI dataset. International Journal of Robotics Research, 32(11), 2013.
Glorot, Xavier and Bengio, Yoshua. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
Hariharan, Bharath, Arbelaez, Pablo, Bourdev, Lubomir D., Maji, Subhransu, and Malik, Jitendra. Semantic contours from inverse detectors. In ICCV, 2011.
He, Xuming, Zemel, Richard S., and Carreira-Perpin˜an, Miguel ´ ´A. Multiscale conditional random fields for image labeling. In CVPR, 2004.
Holschneider, M., Kronland-Martinet, R., Morlet, J., and Tchamitchian, Ph. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space. Proceedings of the International Conference, 1987.
Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross B., Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. In Proc. ACM Multimedia, 2014.
Kohli, Pushmeet, Ladicky, Lubor, and Torr, Philip H. S. Robust higher order potentials for enforcing label consistency. IJCV, 82(3), 2009.
Krahenb ¨ uhl, Philipp and Koltun, Vladlen. Efficient inference in fully connected CRFs with Gaussian edge ¨ potentials. In NIPS, 2011.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
Kundu, Abhijit, Vineet, Vibhav, and Koltun, Vladlen. Feature space optimization for semantic video segmentation. In CVPR, 2016.
Ladicky, Lubor, Russell, Christopher, Kohli, Pushmeet, and Torr, Philip H. S. Associative hierarchical CRFs for object class image segmentation. In ICCV, 2009.
Le, Quoc V., Jaitly, Navdeep, and Hinton, Geoffrey E. A simple way to initialize recurrent networks of rectified linear units. arXiv:1504.00941, 2015.
LeCun, Yann, Boser, Bernhard, Denker, John S., Henderson, Donnie, Howard, Richard E., Hubbard, Wayne, and Jackel, Lawrence D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 1989.
Lin, Guosheng, Shen, Chunhua, Reid, Ian, and van dan Hengel, Anton. Efficient piecewise training of deep structured models for semantic segmentation. arXiv:1504.01013, 2015.
Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollar, Piotr, ´ and Zitnick, C. Lawrence. Microsoft COCO: Common objects in context. In ECCV, 2014.
Liu, Buyu and He, Xuming. Multiclass semantic video segmentation with object-level active inference. In CVPR, 2015.
Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
Noh, Hyeonwoo, Hong, Seunghoon, and Han, Bohyung. Learning deconvolution network for semantic segmentation. In ICCV, 2015.
Ros, German, Ramos, Sebastian, Granados, Manuel, Bakhtiary, Amir, V ´ azquez, David, and L ´ opez, Anto- ´ nio Manuel. Vision-based offline-online perception paradigm for autonomous driving. In WACV, 2015.
Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J. Learning representations by backpropagating errors. Nature, 323, 1986.
Shensa, Mark J. The discrete wavelet transform: wedding the a trous and Mallat algorithms. ` IEEE Transactions on Signal Processing, 40(10), 1992.
Shotton, Jamie, Winn, John M., Rother, Carsten, and Criminisi, Antonio. TextonBoost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81 (1), 2009.
Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
Sturgess, Paul, Alahari, Karteek, Ladicky, Lubor, and Torr, Philip H. S. Combining appearance and structure from motion features for road scene understanding. In BMVC, 2009.
Tighe, Joseph and Lazebnik, Svetlana. Superparsing – scalable nonparametric image parsing with superpixels. IJCV, 101(2), 2013.
Zheng, Shuai, Jayasumana, Sadeep, Romera-Paredes, Bernardino, Vineet, Vibhav, Su, Zhizhong, Du, Dalong, Huang, Chang, and Torr, Philip. Conditional random fields as recurrent neural networks. In ICCV, 2015.

Appendix A Urban scene understanding

 In this appendix, we report experiments on three datasets for urban scene understanding: the CamVid dataset (Brostow et al, 2009), the KITTI dataset (Geiger et al, 2013), and the new Cityscapes dataset (Cordts et al., 2016). As an accuracy measure, we use average IoU (Everingham et al, 2010). We only train our model on the training set, even if the validation set is available. The results reported in this section do not use conditional random fields or other forms of structured prediction. They are obtained by a convolutional network combining front-end and context modules, similar to the “Front + Basic” network evaluated in Table 3. The trained model can be found at https://github.com/fyu/dilation.

 We now summarize the training procedure used to train the front-end module. This procedure works for all datasets. Training using stochastic gradient descent. Each mini-batch contains 8 crops from randomly sampled images. Each crop is 628×628 in size and randomly sampled from the padding image. Fill the image with reflective fill. The middle layer uses no padding. The learning rate is 10 −4 and the momentum is set to 0.99. The number of iterations depends on the number of images in the dataset and is reported below for each dataset.

 The context modules used for these datasets are all from the “Basic” network, using the terminology in Table 1. The number of channels in each layer is the predicted class CCC quantity. (For example, the Cityscapes dataset hasC = 19 C=19C=19 ) Each layer in the context module is padded so that the input and response maps have the same size. The number of layers in the context module depends on the resolution of the images in the dataset. The joint training of a complete model consisting of front-end and context modules is summarized below for each dataset.

A.1 CAMVID

 We use the split of Sturgess et al. (2009) to divide the dataset into 367 training images, 100 validation images, and 233 test images. 11 semantic classes are used. The image is downsampled to 640×480.

 The context module has 8 layers, similar to the model used for the Pascal VOC dataset in the main body of this article. The overall training process is as follows. First, the front-end module is trained for 20K iterations. The complete model (frontend + context) is then jointly trained by samples of size 852 × 852 with a batch size of 1. The learning rate of joint training is set to 10 −5 and the momentum is set to 0.9.

 Table 5 reports the results on the CamVid test set. We call the complete convolutional network (frontend + context) Dilation8 because the context module has 8 layers. Our model outperforms previous work. This model was used as a unary classifier in the recent work of Kundu et al. (2016).

table 5

Table 5: Semantic segmentation results on CamVid dataset. Our model (Dilation8) is consistent with ALE (Ladicky et al, 2009), SuperParsing (Tighe & Lazebnik, 2013), Liu and He (Liu & He, 2015), SegNet (Badrinarayanan et al, 2015) and the DeepLab- LargeFOV model ( Chen et al., 2015a). Our model outperforms previous work.

A.2 KITTI

 We use the training and validation split of Ros et al. (2015): 100 training images and 46 testing images. These images were collected from the KITTI visual odometry/SLAM dataset. The image resolution is 1226×370. We removed layer 6 in Table 1 due to the smaller vertical resolution compared to other datasets. The generated context module has 7 layers. The complete network (frontend + context) is called Dilation7.

 The frontend is trained for 10K iterations. Next, the front-end and context modules are jointly trained. For joint training, the crop size is 900 × 900, the momentum is set to 0.99, and other parameters are the same as those used in the CamVid dataset. Joint training was performed for 20K iterations.

 The results are shown in Table 6. As shown in the table, our model outperforms previous work.

Table 6

Table 6: Semantic segmentation results on KITTI dataset. We compare our results with Ros et al. (2015) and the DeepLab-LargeFOV model (Chen et al., 2015a). Our network (Dilation7) produces higher accuracy than previous work.

A.3 CITYSCAPES

 The Cityscapes dataset contains 2975 training images, 500 validation images, and 1525 test images (Cordts et al, 2016). Due to the high image resolution (2048×1024), we add two layers to the context network after layer 6 in Table 1. The expansions of these two layers are 32 and 64 respectively. The total number of layers in the context module is 10, we call the complete model (frontend + context) Dilation10.

 The Dilation10 network is trained in three stages. First, the front-end prediction module was trained for 40K iterations. Second, the context module is trained on the whole (uncropped) image for 24K iterations with a learning rate of 10−4, a momentum of 0.99, and a batch size of 100. Third, the full model (frontend + context) was jointly trained for 60K iterations on half of the images (input size 1396×1396, including padding), with a learning rate of 10 −5, momentum of 0.99, and a batch size of 1 .
Figure 5 visualizes the impact of the training phase on model performance. Tables 7 and 8 present the quantitative results.

 Cordts et al. (2016) compared the performance of Dilation10 with previous work on the Cityscapes dataset. In their evaluation, Dilation10 outperformed all previous models (Cordts et al, 2016). Dilation10 was also used as a unary classifier in the recent work of Kundu et al. (2016), which uses structured predictions to further improve accuracy.

Figure 5

Figure 5: Results produced by the Dilation10 model after different training stages. (a) Input image. (b) Ground truth segmentation. (c) Segments produced by the model after the first stage of training (front-end only). (d) Segments produced after the second stage, training the context module. (e) The resulting segmentation after the third stage, in which the two modules are jointly trained.

Table 7

Table 7: Per-class and average class-level IoU achieved by our model (Dilation10) on the Cityscapes dataset.

Table 8

Table 8: Per-class and average class-level IoU on Cityscapes dataset.

Guess you like

Origin blog.csdn.net/weixin_42475026/article/details/129747639