SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS AND FULLY CONNECTED CRFS 论文精读

Semantic Image Segmentation Using Deep Convolutional Networks and Fully Connected CRFS

Personal summary

1. There are two main modules in this article, one is to use hole convolution to replace the original convolution kernel, and increase the receptive field without increasing the amount of calculation; the second module is to use CRF for semantic segmentation

2. Refer to Figure 3 for the context of the whole paper

3. Conditional random field CRF, this algorithm is a chapter in the statistical learning method, recommended blog Conditional random field CRF detailed explanation - Zhihu , this algorithm has been carefully read a long time ago, many mathematical things were forgotten later, not recommended In-depth research

Summary

Deep convolutional neural networks (DCNNs) have recently demonstrated state-of-the-art performance in high-level vision tasks such as image classification and object detection. This work combines methods of DCNN and probabilistic graphical models for solving pixel-level classification tasks (also known as "semantic image segmentation"). We show that the responses of the last layer of the DCNN are not sufficiently localized for accurate object segmentation. This is due to the very invariant property that makes DCNNs suitable for high-level tasks. We overcome this poor localization property of deep networks by combining the responses of the final DCNN layer with fully connected conditional random fields (CRF). Qualitatively, our "DeepLab" system is able to localize segmentation boundaries with an accuracy exceeding previous methods. Quantitatively, our method sets a new state-of-art on the PASCAL VOC-2012 semantic image segmentation task, achieving 71.6% IOU accuracy on the test set. We show how to achieve these results efficiently: Careful network reuse and a novel application of the "hole" algorithm from the wavelet community allows intensive computation of neural network responses at 8 frames per second on modern GPUs.

1 Introduction

Deep convolutional neural networks (DCNNs) have been the method of choice for document recognition since LeCun et al. (1998), but have only recently become mainstream in advanced vision research. In the past two years, DCNNs have made leaps and bounds in bringing computer vision systems to a range of high-level problems, including image classification (Krizhevsky et al., 2013; Sermanet et al., 2013; Simonyan and Zisserman , 2014); Szegedy et al., 2014; Papandreou et al., 2014), object detection (Girshick et al., 2014), fine-grained classification (Zhang et al., 2014) and so on. A common theme in these works is that DCNNs trained in an end-to-end manner provide much better results than systems relying on carefully crafted representations such as SIFT or HOG features. This success can be partially attributed to the intrinsic invariance of DCNNs to local image transformations, which underpins their ability to learn hierarchical abstractions from data (Zeiler & Fergus, 2014). While this invariance is clearly desirable for high-level vision tasks, it can hinder low-level tasks such as pose estimation (Chen & Yuille, 2014; Tompson et al., 2014) and semantic segmentation—we want precise localization, rather than an abstraction of spatial details.

There are two technical barriers to applying DCNNs to image labeling tasks: signal downsampling and spatial "insensitivity" (invariance). The first problem is related to the reduction in signal resolution caused by repeated combinations of max-pooling and downsampling ("stride") performed at each layer of a standard DCNN, we employ a method originally developed for efficiently computing the undecimated discrete wavelet transform The "atrous" (with holes) algorithm. This enables efficient intensive computation of DCNN responses with much simpler schemes than earlier solutions to this problem.

The second issue involves obtaining object-centric decisions from classifiers that need to be invariant to spatial transformations, which inherently limits the spatial accuracy of DCNN models. We improve the model's ability to capture fine details by employing fully connected conditional random fields (CRFs). Conditional random fields have been widely used in semantic segmentation, combining class scores computed by multiway classifiers with low-level information captured by local interactions of pixels and edges (Rother et al., 2004; Shotton et al., 2009) or superpixels (Lucchi et al. et al., 2011).

Although more sophisticated work has been proposed to model hierarchical dependencies (He et al., 2004; Ladicky et al., 2009; Lempitsky et al., 2011) and or segment higher-order dependencies (Delong et al., 2011), 2012; Gonfaus et al., 2010; Kohli et al., 2009; Chen et al., 2013; Wang et al., 2015), we use the fully connected pairwise CRF calculation proposed by Krähenbühl & Koltun (2011), and capture Capability of fine edge detail while also satisfying long distance dependencies. This model was shown in Krähenbühl & Koltun (2011) to greatly improve the performance of boosting-based pixel-level classifiers, and in our work we show that it produces state-of-the-art results when combined with a DCNN-based pixel-level classifier.

The three main advantages of our "DeepLab" system are (i) speed: with the "atrous" algorithm, our dense DCNN runs at 8fps, while the mean-field inference of a fully-connected CRF takes 0.5 seconds, (ii) accuracy: we Achieved state-of-the-art results on the PASCAL semantic segmentation challenge, outperforming the suboptimal approach of Mostajabi et al. (2014) 7.2% margin and (iii) simplicity: Our system consists of a cascade of two fairly well-established modules, DCNNs and CRFs.

2. Related work

Our system directly deals with pixel representations, similar to Long et al. (2014). This is in stark contrast to the two-stage approach now most common in DCNN semantic segmentation: this technique typically uses a cascade of bottom-up image segmentation and DCNN-based region classification, which makes the system promise a potentially error-prone front-end segmentation system . For example, bounding box proposals and masked regions provided by (Arbeláez et al., 2014; Uijlings et al., 2013) are used in Girshick et al. (2014) and (Hariharan et al., 2014b) as input to a DCNN, Introduce shape information into the classification process. Likewise, the authors of Mosajabi et al. (2014) rely on superpixel representations. A well-known non-DCNN predecessor of these works is the second-order pooling method of (Carreira et al., 2012), which also assigns labels to the region proposals provided by (Carreira & Sminchisescu, 2012). The authors of Cogswell et al. Understanding the dangers of committing to a single segmentation (2014) explored a different set of CRF-based segmentation proposals, building on (Yadollahpour et al., 2013), also developed by (Carreira and Sminchisescu, 2012 year) is calculated. These segmentation proposals are then re-ranked according to a DCNN trained specifically for this re-ranking task. Although this approach explicitly tries to deal with the temperamental nature of front-end segmentation algorithms, there is still no explicit utilization of DCNN scores in CRF-based segmentation algorithms: the DCNN is only applied after the fact, while it directly tries to use its results during segmentation is meaningful.

Turning to work closer to our approach, several other researchers have considered using DCNN features computed by convolutions for dense image labeling. First among these is Farabet et al. (2013) applied a DCNN at multiple image resolutions and then used a segmentation tree to smooth the prediction results; more recently, Hariharan et al. (2014a) suggested concatenating the computed intermediate feature maps in a DCNN for pixel classification, and Dai et al. ) propose to pool intermediate feature maps via region proposals. Although these works still use segmentation algorithms separate from the DCNN classifier results, we argue that it is advantageous to use segmentation only at a later stage, avoiding the commitment of premature decisions.

More recently, segmentation-free techniques of (Long et al., 2014; Eigen & Fergus, 2014) directly apply a DCNN to the entire image in a sliding window manner, replacing the last fully-connected layer of a DCNN with a convolutional layer. To deal with the spatial localization problem outlined at the beginning of the introduction, Long et al. (2014) upsample and concatenate scores from intermediate feature maps, while Eigen & Fergus (2014) refine predictions from coarse to fine by propagating the coarse result to another DCNN.

The main difference between our model and other state-of-the-art models is the combination of pixel-level CRF and DCNN-based “unary term”. Cogswell et al. focus on the closest works in this direction. (2014) used CRF as a proposed mechanism for a DCNN-based reranking system, while Farabet et al. (2013) treat superpixels as nodes of local pairwise CRFs and use graph cuts for discrete inference; thus, their results can be limited by errors in superpixel computation while ignoring long-range superpixel dependencies. In contrast, our method treats each pixel as a CRF node, exploits long-range dependencies, and directly optimizes the DCNN-driven cost function using CRF inference. We note that mean fields have been extensively studied for traditional image segmentation/edge detection tasks, e.g. (Geiger & Girosi, 1991; Geiger & Yuille, 1991; Kokkinos et al., 2008), but more recently Krähenbühl & Koltun (2011) It is shown that inference works very well for fully connected CRFs and is particularly effective in the context of semantic segmentation.

After the first version of our manuscript was made public, we noticed two other groups independently and concurrently pursuing a very similar direction, combining DCNNs with densely connected CRFs (Bell et al., 2014; Zheng et al., 2015) . There are some technical differences between the models. Bell et al. (2014) focus on the material classification problem, while Zheng et al. (2015) unfold the CRF mean-field inference step to convert the whole system into an end-to-end trainable feed-forward network.

We have updated our proposed "DeepLab" system with many improved methods and results in our latest work (Chen et al., 2016). We recommend interested readers to read the paper for details.

3. Convolutional Neural Networks for Dense Image Labeling

Here we describe how we redesigned and optimized a publicly available Imagenet pretrained 16-layer classification network (VGG-16) as an efficient dense feature extractor for our dense semantic image segmentation system.

3.1. Efficient and dense sliding window feature extraction using hole algorithm

Dense spatial score evaluation contributes to the success of our dense CNN feature extraction procedure. As a first step towards this, we convert the fully connected layers of VGG-16 to convolutional layers and run the network convolutionally on images at their original resolution. However, this is insufficient as it results in very sparsely computed detection scores (with a stride of 32 pixels). To compute scores more densely at a target stride of 8 pixels, we develop a variant of the method previously used by Giusti et al. (2013); in the network of Simonyan & Zisserman (2014) by Sermanet et al. Subsampling after the max pooling layer and increasing its length (2× in the last three convolutional layers and 4× in the first fully connected layer) by introducing zeros, modifying the convolutions in immediately following layers product filter. We can achieve this more efficiently by keeping the filter intact, rather than sparsely sampling the feature map to which the filter is applied, using an input stride of 2 or 4 pixels, respectively. This method, shown in Fig. 1, is called the "hole algorithm" ("Atrus algorithm"), and was previously developed to efficiently compute undecimated wavelet transforms. We achieved this within the Caffe framework by adding an option to sparsely sample the underlying feature maps to the im2col function (which converts multi-channel feature maps to vectorized patches). This approach is generally applicable, allowing us to efficiently compute dense CNN feature maps at any target subsampling rate without introducing any approximations.

Figure 1: Illustration of the 1-D mesoporous algorithm when kernel size = 3, input stride = 2, and output stride = 1

We follow the procedure of Long et al. to fine-tune the model weights of the Imagenet pretrained VGG-16 network to directly adapt it to the image classification task. We replace the 1000-way Imagenet classifier in the last layer of VGG-16 with a 21-way classifier. Our loss function is the sum of the cross-entropy terms at each spatial location in the CNN output map (subsampled by 8 compared to the original image). All locations and labels are equally weighted in the overall loss function. We target the ground truth labels (8 subsamples). We optimized the weight objective function for all network layers according to the standard SGD procedure of Krizhevsky et al. (2013).

Figure 2: Score map (input before softmax function) and confidence map (output of softmax function). We show the score (first row) and confidence (second row) plots after each mean field iteration. The output of the last DCNN layer is used as input for mean field inference. Preferably in color.

During testing, we need a grade score map at the original image resolution. As shown in Figure 2, and further elaborated in Section 4.1, the class score map (corresponding to log probabilities) is very smooth, which allows us to use simple bilinear interpolation to increase its resolution with negligible computational cost 8 times. Note that the method of Long et al. (2014) does not use dilation and produces very coarse scores (subsampled by a factor of 32) in the CNN output. This forces them to use learned upsampling layers, significantly increasing system complexity and training time: fine-tuning our network on PASCAL VOC 2012 takes about 10 hours, while their reported training time is days (both both on modern GPUs).

3.2. Use the convolutional network to control the size of the receptive field to accelerate intensive calculations

Another key factor in repurposing networks for dense score computation is to explicitly control the network's receptive field size. State-of-the-art DCNN-based image recognition methods rely on networks pre-trained on the Imagenet large-scale classification task. These networks typically have large input image sizes: in the case of the VGG-16 network, we consider its input image size to be 224×224 (zero-padding) and 404×404 pixels if accepted for a convolutional network. After converting the network to a fully convolutional network, the first fully connected layer has 4096 filters with a large spatial size of 7×7, which becomes a computational bottleneck in the computation of dense fractional graphs.

We address this practical issue by spatially subsampling the first FC layer (by simple decimation) to achieve a spatial size of 4×4 (or 3×3). This reduces the receptive field of the network to 128×128 (zero padding) or 308×308 (non-evolved mode), and reduces the computation time of the first FC layer by a factor of 2-3. The VGG-derived network generated using our Caffe-based and Titan GPU implementation is very efficient: given a 306×306 input image, it generates 39×39 dense raw features on top of the network at about 8 frames/s during testing Fraction. The speed during training is 3 frames/s. We also succeeded in reducing the number of channels on the fully connected layer from 4096 to 1024, resulting in a significant reduction in computation time and memory footprint without sacrificing performance, as described in Section 5. Using smaller networks, as in Krizhevsky et al. (2013), enables time-intensive feature computation for video rate testing even on light-weight GPUs.

4. Detailed Boundary Restoration: Fully Connected Conditional Random Fields and Multiscale Prediction

4.1. Deep Convolutional Networks and Localization Challenges

As shown in Figure 2, DCNN score maps can reliably predict the presence and approximate location of objects in images, but are less suitable for pinpointing their contours. There is a natural trade-off between classification accuracy and convolutional network localization accuracy: deeper models with multiple max pooling layers have been shown to be most successful in classification tasks, however, their increasing invariance and large perceptual The wild makes the problem of inferring position from the score of the highest output level more challenging.

Recent work has addressed this localization challenge in two directions. The first approach exploits information from multiple layers in a convolutional network in order to better estimate object boundaries (Long et al., 2014; Eigen & Fergus, 2014). The second approach employs superpixel representations, essentially delegating the localization task to low-level segmentation methods. Following this line is the very recent and very successful approach of Mostajabi et al. (2014).

In Section 4.2, we pursue a new alternative direction based on the coupling of the recognition capabilities of DCNNs and the fine-grained localization accuracy of fully-connected CRFs, and show that it achieves remarkable success in solving localization challenges, generating accurate semantic segmentation As a result, object boundaries are recovered at a level of detail unattainable by existing methods.

4.2. Precisely positioned fully connected conditional random fields

Traditionally, conditional random fields (CRFs) have been used to smooth noisy segmentation maps (Rother et al., 2004; Kohli et al., 2009). Typically, these models include energy terms that couple adjacent nodes, favoring assignment of the same label to spatially close pixels. Qualitatively, the main function of these short-term CRFs is to clean up spurious predictions of weak classifiers based on local handcrafted features.

Figure 3: Model description. Coarse fractional maps of deep convolutional neural networks (with fully convolutional layers) are upsampled by bilinear interpolation. A fully connected CRF is used to refine the segmentation results. Preferably in color.

The score maps and semantic label predictions produced by modern DCNN architectures such as the one we use in this study are qualitatively different compared to these weaker classifiers. As shown in Figure 2, the classification results of Core Map are generally consistent with those of D. In this case, using short-range CRF can be detrimental, since our goal should be to recover the detailed local structure, rather than smoothing it further. Combining contrast-sensitive potentials (Rother et al., 2004) with local-scale CRFs can potentially improve localization, but still misses thin structures and often requires solving expensive discrete optimization problems.

To overcome these limitations of short-range CRFs, we incorporated the fully connected CRF model of Krähenbühl & Koltun (2011) into our system. The model uses the energy function

, where x is the label assignment of the pixel. We use the unary potential θi(xi)=−logP(xi), where P(xi) is the label assignment probability at pixel i calculated by DCNN. The pairwise potential is θij(xi, xj)=µ(xi, xj)pkm=1wm km(fi, fj), where if xi6=xj, µ(xi, xj)=1, otherwise it is zero (that is, the Potts model ). There is a pairwise term for every pair of pixels i and j in the image, no matter how far apart they are from each other, that is, the factor graph of the model is fully connected. Each km is a Gaussian kernel, depending on the features (denoted f) extracted for pixels i and j, and weighted by the parameter wm. We employ bilateral position and color terms, specifically, kernel

Among them, the first kernel depends on the pixel position (denoted as p) and the pixel color intensity (denoted as I), and the second kernel only depends on the pixel position. The hyperparameters σα, σβ, and σγ control the "scale" of the Gaussian kernel.

Crucially, the model is suitable for efficient approximate probabilistic inference (Krähenbühl & Koltun, 2011). The message-passing update under the fully decomposable mean-field approximation b(x) = Qibi(xi) can be expressed as a convolution with a Gaussian kernel in the feature space. High-dimensional filtering algorithms (Adams et al., 2010) significantly speed up this computation, resulting in an algorithm that is very fast in practice, with Pascal VOC images averaging less than 0.5 seconds.

4.3. Multi-scale prediction

Following the recent work of Hariharan et al. 2014a; Long et al. 2014, we also explore a multi-scale prediction method to improve boundary localization accuracy. Specifically, we concatenate two layers of MLP (first layer: 128 3x3 convolutional filters, second layer: 128 1x1 convolutional filters) to the input image and each of the first four max pooling layers. output, whose feature maps are connected to the last layer feature maps of the main network. Therefore, the aggregated feature map fed to the softmax layer augments 5*128=640 channels. We only tune the newly added weights, keeping other network parameters at the values ​​learned by the method in Section 3. As discussed in the experimental section, introducing these additional direct connections from high-resolution layers improves localization performance, but the effect is not as dramatic as that obtained using fully connected CRF.

Guess you like

Origin blog.csdn.net/XDH19910113/article/details/123190373