Image Segmentation- U-Net: Convolutional Networks for Biomedical Image Segmentation (MICCAI 2016)

U-Net: Convolutional Networks for Biomedical Image Segmentation - Convolutional Networks for Biomedical Image Segmentation (MICCAI 2016)

Disclaimer: This translation is only a personal study record

Article information

Summary

  It is generally accepted that thousands of annotated training examples are required for successful training of deep networks. In this paper, we propose a network and training strategy that relies on the robust use of data augmentation to more efficiently use available annotated samples. The architecture consists of a contracting path that captures context and a symmetric expanding path that enables precise localization. We demonstrate that such a network can be trained end-to-end from a very small number of images and outperforms the previous best method (sliding window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscope stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC), we won the ISBI Cell Tracking Challenge in these categories in 2015 by a large margin. Also, the internet is fast. The segmentation of a 512x512 image takes less than a second on a recent GPU. The complete implementation (based on Caffe) and the trained network are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.

1 Introduction

  In the past two years, deep convolutional networks have outperformed the state-of-the-art in many visual recognition tasks, such as [7, 3]. Although convolutional networks have been around for a long time [8], their success has been limited due to the size of the available training set and the size of the network under consideration. The breakthrough of Krizhevsky et al. [7] was due to the supervised training of large networks with 8 layers and millions of parameters on the ImageNet dataset with 1 million training images. Since then, even larger and deeper networks have been trained [12].

  A typical use of convolutional networks is in classification tasks, where the output of an image is a single class label. However, in many vision tasks, especially in biomedical image processing, the desired output should include localization, i.e., each pixel should be assigned a class label. Furthermore, in biomedical tasks, thousands of training images are usually out of reach. Therefore, Ciresan et al. [1] train a network in a sliding window setting to predict a class label for each pixel by providing a local region (block) around each pixel as input. First, this network can be positioned. Second, the training data in units of blocks is much larger than the number of training images. The resulting network won the EM segmentation challenge at ISBI 2012 by a wide margin.

insert image description here

Figure 1. U-net architecture (e.g. minimum resolution 32x32 pixels). Each blue box corresponds to a multi-channel feature map. The number of channels is indicated at the top of the box. The xy dimensions are provided at the lower left edge of the box. White boxes indicate replicated feature maps. Arrows indicate different actions.

  Clearly, the strategy of Ciresan et al. [1] has two drawbacks. First, it is very slow because each block has to run the network individually, and there is a lot of redundancy due to overlapping blocks. Second, there is a trade-off between localization accuracy and context usage. Larger blocks require more max-pooling layers, which reduces localization accuracy, while smaller blocks only allow the network to see very little context. Recent methods [11, 4] propose a classifier output that considers multiple layers of features. Good positioning and contextual use are possible.

  In this paper, we build on a more elegant architecture, the so-called "fully convolutional network" [9]. We modify and extend this architecture so that it can handle few training images and produce more accurate segmentations; see Figure 1. The main idea in [9] is to supplement the usual contract network with successive layers, where the pooling operator is replaced by an upsampling operator. Therefore, these layers increase the resolution of the output. For localization, high-resolution features from the contraction path are combined with the upsampled output. Successive convolutional layers can then learn to assemble more accurate outputs based on this information.

insert image description here

Figure 2. Overlapping tile strategy for seamless segmentation of arbitrarily large images (here, segmentation of neuronal structures in EM stacks). Predicting the segmentation in the yellow region requires image data in the blue region as input. Missing input data is inferred by mirroring

  An important modification in our architecture is that in the upsampling part we also have a large number of feature channels that allow the network to propagate context information to higher resolution layers. Thus, the expansion path is more or less symmetric to the contraction path and produces a U-shaped architecture. The network does not have any fully connected layers and only uses the effective part of each convolution, i.e. the segmentation map only contains pixels for which the full context is available in the input image. This strategy allows seamless segmentation of arbitrarily large images via an overlapping tile strategy (see Figure 2). To predict pixels in the boundary region of an image, the missing context is inferred by mirroring the input image. This tiling strategy is important for applying the network to large images, because otherwise the resolution would be limited by GPU memory.

  For our task, very little training data is available, and we use excessive data augmentation by applying elastic deformations to the available training images. This allows the network to learn invariance to such deformations without seeing these transformations in a corpus of annotated images. This is especially important in biomedical segmentation, since deformation used to be the most common change in tissue, and realistic deformation can be effectively simulated. Dosovitskiy et al. [2] showed the value of data augmentation for learning invariance in the context of unsupervised feature learning.

  Another challenge in many cell segmentation tasks is to separate touching objects of the same class; see Figure 3. To this end, we propose to use a weighted loss, where separating background labels between contacting cells gets a larger weight in the loss function.

  The resulting network is applicable to various biomedical segmentation problems. In this paper, we show results on segmentation of neuronal structures in EM stacks (an ongoing competition started at ISBI in 2012), where we outperform the network of Ciresan et al. [1]. Additionally, we show cell segmentation results from light microscope images from the 2015 ISBI Cell Tracking Challenge. Here we win by a wide margin on the two most challenging 2D transmitted light datasets.

2. Network Architecture

  The network architecture is shown in Figure 1. It consists of a contraction path (left) and a dilation path (right). The contraction path follows the typical architecture of convolutional networks. It consists of repeatedly applying two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with a stride of 2 for downsampling. At each downsampling step, we double the number of feature channels. Each step in the dilation path consists of an upsampling of the feature map, followed by a 2x2 convolution that halves the number of feature channels (“upconvolution”), a concatenation with a correspondingly cropped feature map in the shrinkage path, and two 3x3 convolutions with ReLUs behind each convolution. Clipping is necessary because boundary pixels are lost in each convolution. In the last layer, each 64-component feature vector is mapped to the desired number of classes using a 1x1 convolution. The network has a total of 23 convolutional layers.

  In order to achieve seamless stitching of output segmentation maps (see Figure 2), it is important to choose the input patch size such that all 2x2 max pooling operations are applied to layers with even x and y sizes.

3. Training

  The input images and their corresponding segmentation maps are used to train the network via the stochastic gradient descent implementation of Caffe [6]. Due to the unpadded convolution, the output image is smaller than the input image by a constant border width. To minimize overhead and maximize GPU memory utilization, we prefer large input tiles to large batches, thus reducing the batch to a single image. Therefore, we use a high momentum (0.99) such that a large number of previously seen training samples determine updates in the current optimization step.

  The energy function is computed by combining pixel-wise soft-max on the final feature map with a cross-entropy loss function. The soft maximum is defined as pk ( x ) = exp ( ak ( x ) ) / ∑ k ′ = 1 K exp ( ak ′ ( x ) ) p_k(x)=exp(a_k(x))/\sum_{k' =1}^{K}{exp(a_{k'}(x))}pk(x)=exp(ak(x))/k=1Kexp(ak( x )) , whereak ( x ) a_k(x)ak( x ) means at pixel positionx ∈ Ω x ∈ ΩxΩ(使用Ω ⊂ Z 2 Ω ⊂ Z^2OhZ2 ) The feature channelkkactivation in k . KKK is the number of classes,pk ( x ) p_k(x)pk( x ) is the approximate maximum function. That is, for a_k(x) with maximum activationak ( x )ak( x ) kk_k p k ( x ) ≈ 1 p_k(x)≈1 pk(x)1 , while for all otherkkk , thenpk ( x ) ≈ 0 p_k(x) ≈ 0pk(x)0 . Then, cross-entropy penalizes p ℓ ( x ) ( x ) p_{\ell(x)}(x)at each positionp(x)( x ) deviation from 1, using

insert image description here

where ℓ : Ω → { 1 , … , K } \ell:Ω→\{1,…,K\}:Oh{ 1,,K } is the ground truth label for each pixel, andw : Ω → R w: Ω → Rw:OhR is a weight map, which we introduce to give more importance to some pixels during training.

insert image description here

Figure 3. HeLa cells on glass recorded with DIC (Differential Interference Contrast) microscopy. (a) Original image. (b) Stacked ground-truth segmentation. Different colors indicate different instances of HeLa cells. (c) The resulting segmentation mask (white: foreground, black: background). (d) Mapping to force the network to learn boundary pixels.

  We precompute weight maps for each ground-truth split to compensate for the different frequencies of pixels of a certain class in the training dataset and to force the network to learn the small separation boundaries we introduce between contact units (see Figure 3c and d).

  Use morphological operations to compute separation boundaries. Then the weight map is computed as

insert image description here

wc : Ω → R w_c: Ω → Rwc:OhR is the weight map of the balanced class frequency,d 1 : Ω → R d_1:Ω→Rd1:OhR represents the distance to the nearest cell boundary,d 2 : Ω → R d_2:Ω → Rd2:OhR is the distance to the border of the second closest cell. In our experiments, we setw 0 = 10 w_0=10w0=10σ ≈ 5 σ≈5p5 pixels.

  In deep networks with many convolutional layers and different paths through the network, good initialization of weights is extremely important. Otherwise, some parts of the network may overactivate while others never contribute. Ideally, the initial weights should be adapted such that each feature map in the network has approximately unit variance. For a network with our architecture (alternating convolutional and ReLU layers), this can be achieved by starting with a standard deviation of 2/N \sqrt{2/N}2/N This is achieved by drawing the initial weights in a Gaussian distribution, where N represents the number of incoming nodes of a neuron [5]. For example, for a 3x3 convolution and 64 feature channels in the previous layer, N = 9 ⋅ 64 = 576 N=9 64=576N=964=576

3.1 Data Augmentation

  When only few training samples are available, data augmentation is crucial to teach the network the desired invariant and robust properties. In the case of microscope images, we mainly need shift and rotation invariance, as well as robustness to deformation and gray value changes. In particular, random elastic deformation of training samples seems to be a key concept for training segmentation networks with few annotated images. We generate smooth deformations using random displacement vectors on a coarse 3 by 3 grid. Displacements were sampled from a Gaussian distribution with a standard deviation of 10 pixels. Bicubic interpolation is then used to calculate the displacement for each pixel. A dropout layer at the end of the contraction path performs further implicit data augmentation.

4. Experiment

  We demonstrate the application of u-net to three different segmentation tasks. The first task was to segment neuronal structures in electron microscope recordings. An example of the dataset and our obtained splits is shown in Figure 2. We provide the full results as Supplementary Material. This dataset is provided by the EM Segmentation Challenge [14] initiated by ISBI in 2012 and is still accepting new contributions. The training data is a set of 30 images (512 x 512 pixels) from transmission electron microscopy of serial sections of the first instar larval ventral nerve cord (VNC) of Drosophila. Each image has a corresponding, fully annotated ground-truth segmentation map of cells (white) and membranes (black). The test set is public, but its split graph is kept private. Evaluations can be obtained by sending predicted membrane probability maps to the organizers. The evaluation is done by thresholding the map at 10 different levels and computing "warping error", "random error" and "pixel error" [14].

  u-net (averaged over 7 rotated versions of the input data) achieves a warp error of 0.0003529 (new best score, see Table 1) and a randomness of 0.0382 without any further pre- or post-processing error.

  This is significantly better than the sliding-window convolutional network results of Ciresan et al. [1], whose best submission had a warped error of 0.000420 and a random error of 0.0504. The only algorithm that performed better on this dataset in terms of random error used a highly dataset-specific post-processing method applied to the probability maps of Ciresan et al. [1] (The authors of this algorithm have submitted 78 different solutions to achieve this result.).

Table 1. Ranking of EM Segmentation Challenge [14] (6 March 2015), sorted by warp error.

insert image description here

insert image description here

Figure 4. Results of the ISBI Cell Tracking Challenge. (a) Part of the input image of the “PhC-U373” dataset. (b) Segmentation result (cyan mask) versus manual ground truth (yellow boundary) (c) Input image of the “DIC HeLa” dataset. (d) Segmentation results (random colored masks) versus manual ground truth (yellow borders).

Table 2. Breakdown results (IOU) of the 2015 ISBI Cell Tracking Challenge.

insert image description here

  We also apply u-net to the task of cell segmentation in light microscopy images. This segmentation task was part of the ISBI Cell Tracking Challenge in 2014 and 2015 [10, 13]. The first dataset, "PhC-U373" (Data set provided by Dr. Sanjay Kumar. Department of Bioengineering University of California at Berkeley. Berkeley CA (USA)) contains gels on polyacrylimide matrices recorded by phase contrast microscopy. Glioblastoma-astrocytoma U373 cells (see Figure 4a,b and Supplementary Material). It contains 35 partially annotated training images. Here, we achieve an average IOU (“intersection-over-union ratio”) of 92%, which is significantly better than the second-best algorithm at 83% (see Table 2). The second dataset "DIC HeLa" (Data set provided by Dr. Gert van Cappellen Erasmus Medical Center. Rotterdam. The Netherlands) is HeLa cells on flat glass recorded by differential interference contrast (DIC) microscopy (see Figure 3, Figure 4c,d and supplementary material). It contains 20 partially annotated training images. Here we achieve an average IOU of 77.5%, which is significantly better than the next best algorithm at 46%.

5 Conclusion

  The u-net architecture achieves very good performance on very different biomedical segmentation applications. Thanks to elastic deformation data augmentation, it requires very few annotated images, and the training time is very reasonable at only 10 hours on an NVidia Titan GPU (6 GB). We provide a complete Caffe [6]-based implementation and a trained network (U-net implementation, trained networks and supplementary material available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net net). We are sure that the u-net architecture can be easily applied to many more tasks.

Acknowlegements

This study was supported by the Excellence Initiative of the German Federal and State governments (EXC 294) and by the BMBF (Fkz 0316185B)

References

  1. Ciresan, D.C., Gambardella, L.M., Giusti, A., Schmidhuber, J.: Deep neural networks segment neuronal membranes in electron microscopy images. In: NIPS. pp. 2852–2860 (2012)
  2. Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: NIPS (2014)
  3. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
  4. Hariharan, B., Arbelez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization (2014), arXiv:1411.5752 [cs.CV]
  5. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (2015), arXiv:1502.01852 [cs.CV]
  6. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding (2014), arXiv:1408.5093 [cs.CV]
  7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. pp. 1106–1114 (2012)
  8. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4), 541–551 (1989)
  9. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation (2014), arXiv:1411.4038 [cs.CV]
  10. Maska, M., (…), de Solorzano, C.O.: A benchmark for comparison of cell tracking algorithms. Bioinformatics 30, 1609–1617 (2014)
  11. Seyedhosseini, M., Sajjadi, M., Tasdizen, T.: Image segmentation with cascaded hierarchical models and logistic disjunctive normal networks. In: Computer Vision (ICCV), 2013 IEEE International Conference on. pp. 2168–2175 (2013)
  12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014), arXiv:1409.1556 [cs.CV]
  13. WWW: Web page of the cell tracking challenge, http://www.codesolorzano.com/celltrackingchallenge/Cell_Tracking_Challenge/Welcome.html
  14. WWW: Web page of the em segmentation challenge, http://brainiac2.mit.edu/isbi_challenge/

Guess you like

Origin blog.csdn.net/i6101206007/article/details/131984240