DeepLab v1

SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS AND FULLY CONNECTED CRFS

Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs[J]. arXiv preprint arXiv:1412.7062, 2014.

Semantic image segmentation based on deep convolutional network and fully connected CRFS

ABSTRACT

Deep convolutional neural networks (DCNNs) have recently demonstrated state-of-the-art performance in high-level vision tasks such as image classification and object detection. This work combines methods from DCNN and probabilistic graphical models to solve the task of pixel-level classification (also known as "semantic image segmentation"). We demonstrate that the response of the final layer of DCNN is not localized enough for accurate object segmentation. This is due to the very invariant properties that make DCNNs useful for advanced tasks. We overcome this inferior localization property of deep networks by combining the responses of the final DCNN layer with fully connected conditional random fields (CRFs). Qualitatively, our “DeepLab” system is able to locate segment boundaries that exceed the accuracy levels of previous methods. Quantitatively, our method sets the state of the art on the PASCAL VOC-2012 semantic image segmentation task, achieving an IOU accuracy of 71.6% on the test set. We show how to obtain these results efficiently: carefully designed network reuse and a novel application of the "hole" algorithm from the wavelet community enable dense computation of neural network responses at 8 frames per second on modern GPUs.

INTRODUCTION

Deep convolutional neural networks (DCNNs) have been the method of choice for document recognition since LeCun et al. (1998), but have only recently become mainstream in advanced vision research. Over the past two years, DCNNs have driven a surge in the performance of computer vision systems on a wide range of high-level problems, including image classification (Krizhevsky et al., 2013; Sermanet et al., 2013; Simonyan & Zisserman, 2014; Szegedy et al., 2014; Papandreou et al., 2014 ), object detection (Girshick et al., 2014), fine-grained classification (Zhang et al., 2014), etc. A common theme among these works is that DCNNs trained in an end-to-end manner will provide better results than systems that rely on carefully designed representations such as SIFT or HOG features. This success can be partially attributed to DCNNs’ built-in invariance to local image transformations, which supports hierarchical abstraction of their learning data (Zeiler & Fergus, 2014). While this invariance is clearly desirable for high-level vision tasks, it may hinder low-level tasks such as pose estimation (Chen and Yuille, 2014; Tompson et al., 2014) and semantic segmentation—where we require precise Localization rather than abstraction of spatial details.

There are two technical obstacles to applying DCNNs to image labeling tasks: signal downsampling and spatial “insensitivity” (invariance). The first concerns the reduction in signal resolution caused by repeated max pooling and downsampling (“striding”) performed by each layer of standard DCNNs (Krizhevsky et al., 2013; Simonyan & Zisserman, 2014; Szegedy et al., 2014). Instead, like Papandreou et al. (2014), we use a “with holes” (using holes) algorithm originally developed for efficient computation of undownsampled discrete wavelet transforms. This allows efficient and dense computation of DCNN responses in a much simpler scheme than earlier solutions to this problem (Giusti et al., 2013; Sermanet et al., 2013).

The second issue is directly related to the fact that obtaining object-oriented decisions from the classifier requires invariance to spatial transformations, essentially limiting the spatial accuracy of DCNN models. We improve the model's ability to capture details by employing fully connected conditional random fields (CRF). Conditional random fields have been widely used in semantic segmentation, where the class scores computed by multi-way classifiers interact with local pixels and edges (Rother et al., 2004; Shotton et al., 2009) or superpixels (Lucchi et al., 2011 ) information is combined. Although more sophisticated works have been proposed to model hierarchical dependencies (He et al., 2004; Ladicky et al., 2009; Lempitsky et al., 2011) and/or high-order dependencies of segments (Delong et al., 2012; Gonfaus et al., 2010; Kohli et al., 2009; Chen et al., 2013; Wang et al., 2015), but we use the fully connected pairwise CRF proposed by Krahenb\¨uhl & Koltun (2011) as its effective calculation and capturing edge Known for his ability in detail. It also satisfies long-distance dependencies. In Krahenb\¨uhl & Koltun (2011), this model showed significant improvements in the performance of boosting-based pixel-level classifiers, and in our work we demonstrate that it achieves state-of-the-art performance when combined with a DCNN-based pixel-level classifier. the result of.

The three main advantages of our "DeepLab" system are: (i) Speed: Due to the characteristics of the "atrous" algorithm, our dense DCNN operates at 8 fps, while mean field inference of fully connected conditional random fields takes 0.5 seconds ; (ii) Accuracy: We achieve state-of-the-art results on the PASCAL Semantic Segmentation Challenge, improving by 7.2% on the second-best method proposed by Mostajabi et al. (2014); (iii) Scalability: We The system consists of a cascade of two fairly mature modules, namely DCNN and CRF.

2 RELATED WORK

Our system works directly on the pixel representation, similar to the method of Long et al. (2014). This is different from the two-step approach to DCNN-based semantic segmentation that is now the most common: these techniques typically use a cascade of bottom-up image segmentation and DCNN-based region classification, which makes the system potentially susceptible to potential errors in the front-end segmentation system. . For example, the bounding box proposals and mask regions provided by (Arbelaez et al., 2014; Uijlings et al., 2013) were used as input by Girshick et al. (2014) and (Hariharan et al., 2014b) to introduce shape information into classification in process. Similarly, the authors of Mostajabi et al. (2014) relied on superpixel representations. The most famous of the non-DCNN predecessors of these works is the second-order pooling method of (Carreira et al., 2012), which also assigns labels to the region proposals proposed by (Carreira & Sminchisescu, 2012). Considering the risk of committing to a single split, Cogswell et al. (2014) built a CRF-based split proposal diversity set based on (Yadollahpour et al., 2013) and also calculated by (Carreira & Sminchisescu, 2012). These segmentation proposals are then rearranged according to a DCNN specialized for this reranking task. Although this approach explicitly attempts to deal with the elusive nature of front-end segmentation algorithms, it still does not explicitly exploit the fractional advantages of DCNN for CRF-based segmentation algorithms: DCNN is only applied at a later stage, while direct attempts to use it in the segmentation process result in meaningful.

Work approaching our approach Several other researchers have considered leveraging convolutionally computed DCNN features for dense image labeling. The earliest of these is Farabet et al. (2013), who apply DCNN on multiple image resolutions and then use segmentation trees to smooth the prediction results; more recently, Hariharan et al. (2014a) propose to concatenate the intermediate feature maps computed in DCNN For pixel classification, Dai et al. (2014) proposed pooling intermediate feature maps via region proposals. Although these works still employ separate segmentation algorithms from the DCNN classifier, we believe that there are advantages to using segmentation only at a later stage, avoiding the commitment to premature decisions.

Recently, the segmentation-free techniques of Long et al. (2014) and Eigen & Fergus (2014) directly apply DCNN to the entire image in a sliding window manner, replacing the last fully connected layer of the DCNN with a convolutional layer. To address the spatial localization problem in the introduction, Long et al. (2014) upsample and concatenate the scores of intermediate feature maps, while Eigen & Fergus (2014) refine the prediction results from coarse to fine by propagating the coarse results into another DCNN.

The main difference between our model and other state-of-the-art models is the combination of pixel-wise CRF and DCNN-based “unary terms”. The closest work focusing in this direction is Cogswell et al. (2014) using CRF as a proposed mechanism for a DCNN-based reranking system, while Farabet et al. (2013) treated superpixels as nodes of local pairwise CRFs and used graph Cuts perform discrete inference; therefore, their results may be limited by errors in superpixel computation while ignoring long-range superpixel dependencies. Our method treats each pixel as a CRF node, exploits long-distance dependencies, and uses CRF inference to directly optimize the cost function driven by DCNN. We note that mean fields have been widely studied for traditional image segmentation/edge detection tasks, e.g., (Geiger & Girosi, 1991; Geiger & Yuille, 1991; Kokkinos et al., 2008), but recently Krahenbühl & Koltun (2011) showed that inference is very useful for fully connected CRFs. Effective, especially in the context of semantic segmentation.

After the first version of our manuscript was made public, we noticed that two other groups independently and simultaneously pursued a very similar direction, combining DCNN and densely connected CRFs (Bell et al., 2014; Zheng et al., 2015). There are several technical differences between the respective models. Bell et al. (2014) focused on the material classification problem, while Zheng et al. (2015) expanded the CRF mean field inference step to convert the entire system into an end-to-end trainable feedforward network.

We updated our proposed “DeepLab” system with improved methods and results, the latest work being published in Chen et al. (2016). we will be pleased

3 CONVOLUTIONAL NEURAL NETWORKS FOR DENSE IMAGE LABELING

Here we describe how to re-tune and fine-tune the publicly available Imagenet pre-trained 16-layer classification network (Simonyan & Zisserman, 2014) (VGG-16), turning it into efficient and effective dense feature extraction for our dense semantic image segmentation system device.

3.1 EFFICIENT DENSE SLIDING WINDOW FEATURE EXTRACTION WITH THE HOLE ALGORITHM

The computation of dense spatial scores is critical to the success of our dense convolutional neural network feature extractor. To achieve this, we convert the fully connected layers of VGG-16 into convolutional layers and run the network in a convolutional manner at its native resolution. However, this is not enough as it produces very sparse detection scores (32 pixel stride). To compute the score more densely when the target stride is 8 pixels, we developed a method that is a variant of the method previously used in Giusti et al. (2013) and Sermanet et al. (2013). We skip the downsampling operations after the last two max-pooling layers in the Simonyan & Zisserman (2014) network and modify the convolutional filters in its subsequent convolutional layers to increase their length by introducing zeros (in the last three convolutions 2x in the cumulative layer and 4x in the first fully connected layer). We can implement this method more efficiently by keeping the filter unchanged and using sparse feature maps with input strides of 2 or 4 pixels respectively. As shown in Figure 1, this method is known as the 'atrous algorithm' ('atrous algorithm') and has been previously used to efficiently compute wavelet transforms without downsampling (Mallat, 1999). We implemented this approach in the Caffe framework (Jia et al., 2014) by adding an option to the im2col function to sparsely sample the feature map below. This approach is generally applicable and allows us to efficiently compute dense convolutional neural network feature maps at target downsampling rates without introducing any approximations.

Based on the method of Long et al. (2014), we directly fine-tune the Imagenet pre-trained VGG-16 network model to adapt to the image classification task. We replace the 1000-way Imagenet classifier in the last layer of VGG-16 with a 21-way classifier. Our loss function is the sum of the cross-entropy terms at each spatial location in the CNN output map (downsampled 8x relative to the original image). In the entire loss function, each position and label weight are equal. Our goal is to annotate the correct label (8 times downsampled relative to the original image). We optimize the objective function of all network layer weights using the standard SGD procedure of Krizhevsky et al. (2013).

For testing, we need to generate class score maps at the original image resolution. As shown in Figure 2, and explained further in Section 4.1, the class score map (corresponding to the log probability) is very smooth, which allows us to increase its resolution by a factor of 8 using simple bilinear interpolation at a computational cost of can be ignored. Note that the method of Long et al. (2014) does not use the hole algorithm and produces very coarse scores (32x downsampling) at the CNN output. This forces them to use learned upsampling layers, significantly increasing the system complexity and training time: it took us about 10 hours to fine-tune the network on the PASCAL VOC 2012 dataset, while they reported training times of several days (both Measured on a modern GPU).


Figure 1: Schematic diagram showing the algorithm using atrous convolution in the 1-D case when the kernel size is 3, the input stride is 2, and the output stride is 1.

3.2 Controlling the receptive field size and accelerating intensive calculations in convolutional networks

In order to reuse our network for dense scoring calculations, another key element is to explicitly control the receptive field size of the network. Most recent DCNN-based image recognition methods rely on networks pretrained on the Imagenet large-scale classification task. These networks generally have large receptive field sizes: in the case of the VGG-16 network we consider, it has a receptive field size of 224 × 224 (using zero padding), or 404 × 404 pixels if applied convolutionally . After converting the network to a fully convolutional network, the first fully connected layer has 4,096 large filters of size 7 × 7 and becomes the computational bottleneck in our dense score map calculation.

We solve this practical problem by spatially subsampling (by simple sampling) the first fully connected layer to a 4×4 (or 3×3) spatial size. This shrinks the network's receptive field to 128×128 (with zero padding) or 308×308 (in convolutional mode) and reduces the computation time of the first fully connected layer by a factor of 2-3. Using our Caffe-based implementation and Titan GPUs, the resulting VGG-derived network is very efficient: given a 306x306 input image, it produces 39×39 dense raw features on top of the network at about 8 frames/second during testing Score. The speed during training is 3 frames/second. We also successfully tried to reduce the number of channels in the fully connected layer from 4,096 to 1,024, significantly reducing the computation time and memory footprint without losing performance, see Section 5 for details. Using smaller networks, such as Krizhevsky et al. (2013), it is even possible to achieve test-time intensive feature computation at a video rate of once per second on lightweight GPUs.

4 Detailed Boundary Recovery: Fully Connected Conditional Random Fields and Multi-Scale Prediction

4.1 Deep convolutional networks and positioning challenges

As shown in Figure 2, DCNN score maps reliably predict the presence and approximate location of objects in images, but are less suitable for accurately locating their outlines. There is a natural trade-off between classification accuracy and localization accuracy using convolutional networks: deep models with multiple max-pooling layers prove most successful in classification tasks, yet their increased invariance and large receptive fields make output from their top Fractional inference of horizontal position becomes a more challenging problem.

Recent work has pursued two directions to address this positioning challenge. The first approach is to exploit information in multiple layers in a convolutional network to better estimate the boundaries of objects (Long et al., 2014; Eigen & Fergus, 2014). The second approach is to employ superpixel representations, essentially delegating the localization task to low-level segmentation methods. This is the route taken by the highly successful recent approach of Mostajabi et al. (2014).

In Section 4.2, we pursue a novel alternative direction based on combining the discriminative power of DCNNs with the fine-grained localization accuracy of fully connected CRFs, demonstrating the remarkable success of this approach in solving localization challenges, yielding Accurate semantic segmentation results and recovers object boundaries at a level of detail that is unreachable by existing methods.

4.2 Fully connected conditional random fields for precise positioning

Insert image description here
Figure 2: Score plot (previous input to the softmax function) and confidence plot (softmax function output) for the aircraft category. We show plots of scores (first row) and confidence (second row) after each mean field iteration. The output of the last DCNN layer is used as input for mean field inference. Best viewed in color.

Figure 3: Model description. Coarse score maps obtained from deep convolutional neural networks with fully convolutional layers are upsampled via bilinear interpolation. Fully connected CRF is applied to refine the segmentation results. Best shown in color.

Traditionally, conditional random fields (CRF) have been used to smooth noisy segmentation maps (Rother et al., 2004; Kohli et al., 2009). Typically these models include energy terms that couple adjacent nodes, tending to assign the same label to spatially adjacent pixels. Qualitatively, the main function of these short-range CRFs is to clean up spurious predictions from weak classifiers built based on local hand-designed features.

Compared to these weaker classifiers, the modern DCNN architecture used in this paper produces score maps and semantic label predictions that are qualitatively different. As shown in Figure 2, score plots are usually very smooth and produce homogeneous classification results. In this case, using short-range CRF can be harmful, since our goal should be to recover the detailed local structure rather than smoothing it further. Using contrast-sensitive potentials (Rother et al., 2004) combined with local-scale CRFs can potentially improve localization, but still ignores fine structures and often requires solving expensive discrete optimization problems.

To overcome these short-range CRF limitations, we integrate the fully connected CRF model of Krahenb¨uhl & Koltun (2011) into the system. The model employs an energy function

where x is the label assignment of the pixel. We use as unit potential θ i ( xi ) = − log P ( xi ) θ_i(x_i) = − log P(x_i)ii(xi)=logP(xi) , where P(xi) is the label assignment probability at pixel i calculated by DCNN. The form of the pairwise potential isθ ij ( xi , xj ) = µ ( xi , xj ) ∑ m = 1 K wm ⋅ km ( fi , fj ) θ_{ij} (x_i, x_j) = µ(x_i, x_j) \ sum ^K _{m = 1} w_m·k^m(f_i,f_j)iij(xixj)=µ ( xixj)m=1Kwmkm(fifj) , in whichµ ( xi , xj ) = 1 µ(x_i, x_j) = 1µ ( xixj)=1If xi ≠xj x_i ≠ x_jxi=xj, and zero otherwise (i.e. Potts model). Each pair of pixels i and j in the image has a pairwise term, no matter how far apart they are, i.e. the factor graph of the model is fully connected. Each kmk^mkm is a feature (denoted as f) that depends on the extracted pixels i and j, and is determined by the parameterwm w_mwmWeighted Gaussian kernel. We adopt bilateral position and color terms, in particular the kernels are:

where the first kernel depends on the pixel position (denoted as p) and the pixel color intensity (denoted as I), while the second kernel only depends on the pixel position. The hyperparameters σα, σβ and σγ control the "scale" of the Gaussian kernel.

Crucially, this model lends itself to efficient approximate probabilistic reasoning (Krahenb¨uhl & Koltun, 2011). In the fully decomposable mean field approximation b ( x ) = ∏ ibi ( xi ) b(x)=\prod_ib_i(xi)b(x)=ibiUnder ( xi )
, the passing message update can be expressed by convolution with a Gaussian kernel in the feature space . The high-dimensional filtering algorithm (Adams et al., 2010) significantly speeds up this calculation, resulting in the algorithm being very fast in practice, averaging less than 0.5 seconds for Pascal VOC images using a public implementation (Krahenb¨uhl & Koltun, 2011).

4.3 MULTI-SCALE PREDICTION

Given the promising results recently obtained by (Hariharan et al., 2014a; Long et al., 2014), we also explore multi-scale prediction methods to improve boundary localization accuracy. Specifically, we append two layers of MLP (first layer: 128 3x3 convolution filters, second layer: 128 1x1 convolution filters) to the input image and the output of each of the first four max-pooling layers , its feature map is connected with the last layer feature map of the main network. Therefore, the aggregated feature map fed into the softmax layer is augmented by 5 * 128 = 640 channels. We only adjust the newly added weights, keeping other network parameters to the values ​​learned by this method in Section 3. As discussed in the experimental section, introducing these additional direct connections from lower layers of subdivision can improve localization performance, but the effect is not as dramatic as that obtained by a fully connected CRF.

Table 1: (a) Performance of our proposed model on the PASCAL VOC 2012 “val” set (trained using the augmented “train” set). The best performance is achieved by exploiting multi-scale features and a large field of view. (b) Performance of our proposed model (trained on the augmented “trainval” set) on the PASCAL VOC 2012 “test” set, compared with other state-of-the-art methods.

5 EXPERIMENTAL EVALUATION

Dataset We test our DeepLab model on the PASCAL VOC 2012 segmentation benchmark (Everingham et al., 2014), which consists of 20 foreground object classes and one background class. The original dataset contains 1,464, 1,449 and 1,456 images for training, validation and testing respectively. The dataset was augmented with additional annotations provided by Hariharan et al. (2011), resulting in 10,582 training images. Performance is measured as the average intersection of pixels over union (IOU) across 21 categories.

Training We adopt the simplest form of segmented training, decoupling the DCNN and CRF training stages, and assuming that the unary terms provided by DCNN are fixed during CRF training. For DCNN training, we use the VGG-16 network pre-trained on ImageNet. We fine-tune the VOC 21-way pixel classification task via stochastic gradient descent on the cross-entropy loss function as described in Section 3.1. We use a mini-batch of 20 images and an initial learning rate of 0.001 (0.01 is used for the final classifier layer), multiplying the learning rate by 0.1 every 2000 iterations. We use a momentum of 0.9 and a weight decay of 0.0005. After DCNN fine-tuning, we follow the method of Krahenb¨uhl & Koltun (2011) to perform cross-validation on the parameters of the fully connected CRF model in Equation (2). We use the default values ​​of w2 = 3 and σγ = 3 and cross-validate the optimal values ​​of w1, σα and σβ on a small subset of the validation set (using 100 images). We adopt a coarse-to-fine search scheme. Specifically, the initial search range of parameters is w1 ∈ [5,10], σα ∈ [50:10:100] and σβ ∈ [3:1:10] (MATLAB representation ), then we refine the search step size around the best value in the first round. We maintain a mean field iteration number of 10 for all reported experimental settings.

Evaluation on the validation set We perform most of the evaluation on the PASCAL “val” dataset and train our model on the augmented PASCAL “train” dataset. As shown in Table 1(a), incorporating fully connected CRF into our model (called DeepLab-CRF) can significantly improve performance, improving by approximately 4% compared to DeepLab. We note that the work of Krahenb¨uhl & Koltun¨ (2011) improves the 27.6% result of TextonBoost (Shotton et al., 2009) to 29.1%, which makes the improvement we report (from 59.8% to 63.7%) even more impressive.

In terms of qualitative results, we provide a visual comparison between DeepLab and DeepLab-CRF in Figure 7. Adopting a fully connected CRF significantly improves the results, allowing the model to accurately capture complex object boundaries.

Table 2: Impact of receptive fields. We show the performance (after CRF) and training speed on the PASCAL VOC 2012 “val” set as a function of the kernel size of the first fully connected layer and the input stride value employed in the atrous algorithm.

Multi-scale features We also utilize intermediate-level features similar to Hariharan et al. (2014a) and Long et al. (2014). As shown in Table 1(a), adding multi-scale features to our DeepLab model (denoted as DeepLab-MSc) improves the performance by about 1.5% and further incorporates a fully connected CRF (denoted as DeepLab-MSc-CRF ), the performance can be improved by about 4%.
A qualitative comparison between DeepLab and DeepLab-MSc is shown in Figure 4. Object boundaries can be slightly refined using multi-scale features.

Field of view control The "atrous algorithm" we use allows us to arbitrarily control the model's field of view by adjusting the input stride, as shown in Figure 1. In Table 2, we try experiments using several kernel sizes and input strides in the first fully connected layer. The DeepLab-CRF-7x7 method is directly modified from VGG-16 net, where the kernel size is 7×7 and the input stride is 4. The model achieved a performance of 67.64% on the "val" set, but was relatively slow (1.44 images per second during training). After reducing the kernel size to 4×4, we have increased the model speed to 2.9 images per second. We tried two network variants with different field of view sizes, namely DeepLab-CRF and DeepLab-CRF-4x4; the latter has a large FOV (i.e., large input stride) and achieves better performance. Finally, we use a kernel size of 3 × 3 and input stride = 12, and further change the filter size of the last two layers from 4096 to 1024. Interestingly, the resulting model DeepLab-CRF-LargeFOV has the same performance as the expensive DeepLabCRF-7x7. At the same time, it runs 3.36 times faster and has a significantly reduced number of parameters (20.5M instead of 134.3M).

Table 1 summarizes the performance of several model variants, showing the benefits of exploiting multi-scale features and large FOV.


Figure 4: Fusion of multi-scale features can improve boundary segmentation effect. The first and second rows show the results of DeepLab and DeepLab-MSc respectively. Best viewed in color view.

Calculation of average pixel IOU along object boundaries To quantify the accuracy of the proposed model near object boundaries, we use experiments similar to those of Kohli et al. (2009); Krahenb¨uhl & Koltun (2011) to evaluate segmentation accuracy. Specifically, we use the "void" label annotated in the val set, which typically appears around object boundaries. We calculate the average IOU of those pixels that lie within a narrow band (called trimap) of the 'void' label. As shown in Figure 5, utilizing multi-scale features in the intermediate layer and refining the segmentation results through fully connected CRF significantly improves the results near object boundaries.

Comparison with state-of-the-art techniques In Figure 6, we qualitatively compare our proposed model DeepLabCRF with two state-of-the-art models: FCN-8s (Long et al., 2014) and TTI-Zoomout-16 (Mostajabi et al., 2014) in the "val" set (results extracted from their paper). Our model is able to capture complex object boundaries.

Figure 5: (a) Some trimap examples (top left: image. Top right: ground truth. Bottom left: 2 pixels trimap. Bottom right: 10 pixels trimap). Quality of the proposed method's in-band segmentation results around object boundaries. (b) Pixel-by-pixel accuracy. (c) Pixel average IOU.

Figure 6: Comparison with state-of-the-art models on the val set. First row: image. Second row: Ground truth. Third row: other recent models (left: FCN-8s, right: TTI-Zoomout-16). Fourth row: our DeepLab-CRF. Best viewed in color.

Reproducibility We implement the proposed method by extending the excellent Caffe framework (Jia et al., 2014). We have shared the source code, configuration files, and trained model to reproduce the results of this article on the companion website https://bitbucket.org/deeplab/deeplab-public.

Test set results After setting up model selection on the validation set, we evaluate our model variants on the official “test” set of PASCAL VOC 2012. As shown in Table 3, our DeepLab-CRF and DeepLabMSc-CRF models achieved average IOU1 performance of 66.4% and 67.1%, respectively. Our model outperforms all other state-of-the-art models (specifically, TTI-Zoomout-16 (Mostajabi et al., 2014), FCN-8s (Long et al., 2014) and MSRA-CFM (Dai et al., 2014) ). When we increase the FOV of the model, the performance of DeepLab-CRF-LargeFOV reaches 70.3%, which is the same as DeepLab-CRF7x7, but its training speed is faster. Furthermore, our best model DeepLab-MSc-CRF-LargeFOV achieves the best performance of 71.6% by using both multi-scale features and large FOV.

Figure 7: Visualization results on the VOC 2012 validation set. For each row, we show the input image, the segmentation results provided by DCNN (DeepLab), and the refined segmentation results of the fully connected CRF (DeepLab-CRF). We show the failure pattern in the last three lines. Best viewed in color view.

Table 3: Label IOU (%) on PASCAL VOC 2012 test set using trainval set for training.

6 DISCUSSION

This study combines the ideas of deep convolutional neural networks and fully connected conditional random fields to propose a new method that can produce semantically accurate predictions and detailed segmentation maps while being computationally efficient. Our experimental results demonstrate that the proposed method significantly improves the state-of-the-art on the challenging PASCAL VOC 2012 semantic image segmentation task.

We plan to further improve multiple aspects of the model, such as fully integrating its two main components (CNN and CRF) and training it in an end-to-end manner like Krahenbühl & Koltun (2013); Chen et al. (2014); Zheng et al. (2015) whole system. We also plan to try more datasets and apply our method to other data sources such as depth maps or videos. Recently, we employ weakly supervised annotations such as bounding boxes or image-level labels for model training (Papandreou et al., 2015).

At a high level, our work lies at the intersection of convolutional neural networks and probabilistic graphical models. We plan to further investigate the interplay between these two powerful method classes and explore their synergistic potential in solving challenging computer vision tasks.

Acknowledgments

This study was partially supported by ARO 62250-CS, NIH Grant 5R01EY022247-03, EU project RECONFIG FP7-ICT-600825 and EU project MOBOT FP7-ICT-2011-600796. We also thank NVIDIA Corporation for donating GPUs used for this research.

We would like to thank the anonymous reviewers for their detailed comments and constructive feedback.

Paper revision

For the convenience of readers, we present here a list of the main revisions of the paper.

v1 submitted to ICLR 2015. Introduced the DeepLab-CRF model, which achieved a performance of 66.4% on the PASCAL VOC 2012 test set.

v2 is a rebuttal to ICLR 2015. Added DeepLab-MSc-CRF model, which combines multi-scale features from the middle layer. DeepLab-MSc-CRF achieved a performance of 67.1% on the PASCAL VOC 2012 test set.

v3 camera ready for ICLR 2015. Tried a wide field of view. On the PASCAL VOC 2012 test set, DeepLab-CRF-LargeFOV achieves a performance of 70.3%. When exploiting multi-scale features
v4 was introduced using our updated “DeepLab” system (Chen et al., 2016), the results of this system were greatly improved.

REFERENCES

Adams, A., Baek, J., and Davis, M. A. Fast high-dimensional filtering using the permutohedral
lattice. In Computer Graphics Forum, 2010.
Arbelaez, P., Pont-Tuset, J., Barron, J. T., Marques, F., and Malik, J. Multiscale combinatorial ´
grouping. In CVPR, 2014.
Bell, S., Upchurch, P., Snavely, N., and Bala, K. Material recognition in the wild with the materials
in context database. arXiv:1412.0623, 2014.
Carreira, J. and Sminchisescu, C. Cpmc: Automatic object segmentation using constrained parametric min-cuts. PAMI, 2012.
Carreira, J., Caseiro, R., Batista, J., and Sminchisescu, C. Semantic segmentation with second-order
pooling. In ECCV, 2012.
Chen, L.-C., Papandreou, G., and Yuille, A. Learning a dictionary of shape epitomes with applications to image labeling. In ICCV, 2013.
Chen, L.-C., Schwing, A., Yuille, A., and Urtasun, R. Learning deep structured models.
arXiv:1407.2538, 2014.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. Deeplab: Semantic
image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
arXiv:1606.00915, 2016.
Chen, X. and Yuille, A. L. Articulated pose estimation by a graphical model with image dependent
pairwise relations. In NIPS, 2014.
Cogswell, M., Lin, X., Purushwalkam, S., and Batra, D. Combining the best of graphical models
and convnets for semantic segmentation. arXiv:1412.4313, 2014.
Dai, J., He, K., and Sun, J. Convolutional feature masking for joint object and stuff segmentation.
arXiv:1412.1283, 2014.
Delong, A., Osokin, A., Isack, H. N., and Boykov, Y. Fast approximate energy minimization with
label costs. IJCV, 2012.
Eigen, D. and Fergus, R. Predicting depth, surface normals and semantic labels with a common
multi-scale convolutional architecture. arXiv:1411.4734, 2014.
Everingham, M., Eslami, S. M. A., Gool, L. V., Williams, C. K. I., Winn, J., and Zisserma, A. The
pascal visual object classes challenge a retrospective. IJCV, 2014.
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. Learning hierarchical features for scene labeling. PAMI, 2013.
Geiger, D. and Girosi, F. Parallel and deterministic algorithms from mrfs: Surface reconstruction.
PAMI, 13(5):401–412, 1991.
Geiger, D. and Yuille, A. A common framework for image segmentation. IJCV, 6(3):227–243,
1991.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object
detection and semantic segmentation. In CVPR, 2014.
Giusti, A., Ciresan, D., Masci, J., Gambardella, L., and Schmidhuber, J. Fast image scanning with
deep max-pooling convolutional neural networks. In ICIP, 2013.
Gonfaus, J. M., Boix, X., Van de Weijer, J., Bagdanov, A. D., Serrat, J., and Gonzalez, J. Harmony
potentials for joint classification and segmentation. In CVPR, 2010.
Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., and Malik, J. Semantic contours from inverse ´
detectors. In ICCV, 2011.
Hariharan, B., Arbelaez, P., Girshick, R., and Malik, J. Hypercolumns for object segmentation and ´
fine-grained localization. arXiv:1411.5752, 2014a.
Hariharan, B., Arbelaez, P., Girshick, R., and Malik, J. Simultaneous detection and segmentation. ´
In ECCV, 2014b.
He, X., Zemel, R. S., and Carreira-Perpindn, M. Multiscale conditional random fields for image
labeling. In CVPR, 2004.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell,
T. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
Kohli, P., Ladicky, L., and Torr, P. H. Robust higher order potentials for enforcing label consistency.
IJCV, 2009.
Kokkinos, I., Deriche, R., Faugeras, O., and Maragos, P. Computational analysis and learning for a
biologically motivated model of boundary detection. Neurocomputing, 71(10):1798–1812, 2008.
Krahenb ¨ uhl, P. and Koltun, V. Efficient inference in fully connected crfs with gaussian edge poten- ¨
tials. In NIPS, 2011.
Krahenb ¨ uhl, P. and Koltun, V. Parameter learning and convergent inference for dense random fields. ¨
In ICML, 2013.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional
neural networks. In NIPS, 2013.
Ladicky, L., Russell, C., Kohli, P., and Torr, P. H. Associative hierarchical crfs for object class image
segmentation. In ICCV, 2009.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document
recognition. In Proc. IEEE, 1998.
Lempitsky, V., Vedaldi, A., and Zisserman, A. Pylon model for semantic segmentation. In NIPS,
2011.
Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation.
arXiv:1411.4038, 2014.
Lucchi, A., Li, Y., Boix, X., Smith, K., and Fua, P. Are spatial and global constraints really necessary
for segmentation? In ICCV, 2011.
Mallat, S. A Wavelet Tour of Signal Processing. Acad. Press, 2 edition, 1999.
Mostajabi, M., Yadollahpour, P., and Shakhnarovich, G. Feedforward semantic segmentation with
zoom-out features. arXiv:1412.0774, 2014.
Papandreou, G., Kokkinos, I., and Savalle, P.-A. Untangling local and global deformations in deep
convolutional networks for image classification and sliding window detection. arXiv:1412.0296,
2014.
Papandreou, G., Chen, L.-C., Murphy, K., and Yuille, A. L. Weakly- and semi-supervised learning
of a DCNN for semantic image segmentation. arXiv:1502.02734, 2015.
Rother, C., Kolmogorov, V., and Blake, A. Grabcut: Interactive foreground extraction using iterated
graph cuts. In SIGGRAPH, 2004.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. Overfeat: Integrated
recognition, localization and detection using convolutional networks. arXiv:1312.6229, 2013.
Shotton, J., Winn, J., Rother, C., and Criminisi, A. Textonboost for image understanding: Multiclass object recognition and segmentation by jointly modeling texture, layout, and context. IJCV,
2009.
Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and
Rabinovich, A. Going deeper with convolutions. arXiv:1409.4842, 2014.
Tompson, J., Jain, A., LeCun, Y., and Bregler, C. Joint Training of a Convolutional Network and a
Graphical Model for Human Pose Estimation. In NIPS, 2014.
Uijlings, J., van de Sande, K., Gevers, T., and Smeulders, A. Selective search for object recognition.
IJCV, 2013.
Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., and Yuille, A. Towards unified depth and semantic
prediction from a single image. In CVPR, 2015.
Yadollahpour, P., Batra, D., and Shakhnarovich, G. Discriminative re-ranking of diverse segmentations. In CVPR, 2013.
Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In ECCV, 2014.
Zhang, N., Donahue, J., Girshick, R., and Darrell, T. Part-based r-cnns for fine-grained category
detection. In ECCV, 2014.
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., and Torr, P.
Conditional random fields as recurrent neural networks. arXiv:1502.03240, 2015

Guess you like

Origin blog.csdn.net/wagnbo/article/details/130629495