UNet++: A Nested U-Net Architecture for Medical Image Segmentation


Preface

The second medical image segmentation article I read after Unet is used to record learning. Thanks for some excellent blog posts from the big guys who have read


1. Paper translation

Abstract: In this paper, we show a new and stronger network structure UNet++ for medical image segmentation. Our structure is essentially a deeply-supervised encoding-decoding network, encoding and The decoded sub-networks are connected to each other through a series of nested dense hop paths. This redesigned jump connection is mainly used to reduce the semantic gap of feature maps in the encoding and decoding sub-networks. When the semantics of the feature maps of the encoding and decoding networks are similar, we believe that this optimizer can solve a simpler learning task. We compared UNet++, UNet and UNet derivatives in multiple medical image segmentation tasks. The prediction effect of the network: low-dose CT chest X-ray scan tumor segmentation, cell nucleus segmentation in the microscope, abdominal CT liver segmentation, and colon polyp segmentation. Our experiment shows that UNet++ with a deep supervision structure achieves an average IoU of 3.9 and 3.4 points Respectively surpassed the derivatives of U-Net and U-Net.

1 Introduction

  The most advanced image segmentation models are variants of the codec architecture, such as U-Net [9] and Fully Convolutional Network (FCN) [8]. These codec networks used for segmentation have a key similarity: skip connections, which map the deep, semantic, and coarse-grained features from the decoder subnet to the shallow, low-level, and fine-grained features from the encoder subnet. Combine mapping. The skip connection has been proven to effectively restore the fine-grained details of the target object; even in a complex background, it can generate a segmentation mask with fine details. Jump connection is also crucial to the success of instance-level segmentation models (such as Mask-RCNN), which can segment occluded objects. It can be said that image segmentation reaches a satisfactory level of performance in natural image segmentation, but do these models meet the strict segmentation requirements of medical images?
  Segmentation of lesions or abnormalities in medical images requires higher accuracy than in natural images. The precise segmentation mask may not be critical in natural images, but even edge segmentation errors in medical images can lead to poor user experience in clinical environments. For example, subtle burrs around nodules may indicate malignant nodules; therefore, excluding them from the segmentation mask will reduce the credibility of the model in the clinic. In addition, inaccurate segmentation can also lead to major changes in computer-generated diagnostics. For example, erroneous measurements of nodule growth in longitudinal studies may result in the assignment of incorrect lung-RADS categories to screened patients. Therefore, it is necessary to design a more effective image segmentation architecture to effectively restore the fine details of the target object in the medical image.
  In order to solve the need for more precise segmentation in medical images, we propose a new segmentation architecture UNET++ based on nested and dense jump connections. The basic assumption of our architecture is that when the high-resolution feature maps of the encoder network are gradually enriched before and merged with the corresponding semantic-rich feature maps in the decoder network, the model can more effectively capture the fine-grained details of foreground objects. We believe that when the feature maps from the decoder and encoder networks are semantically similar, the network will handle an easier learning task. This is in sharp contrast with the common skip connection commonly used in U-Net, which directly connects the high-resolution feature map from the encoder to the decoder network, thereby realizing the integration of semantically different feature maps. According to our experiments, the proposed architecture is effective and produces significant performance gains on U-Net and wide U-Net.

2 Related Work

  Long et al [8] first proposed FCN, and in the same year (2015) Ronneberger et al. [9] proposed UNet. They all use a key idea: jump connection (different operation). In FCN, the feature map during upsampling uses the feature map from the encoding to perform the summation of pixels, while U-Net performs the concatenation of dimensional splicing, and between each upsampling step Add convolution and nonlinear activation functions. No matter what kind of jump connection, it shows that it can help restore the rich spatial resolution in the output of the network, making the full convolution method suitable for semantic segmentation. Inspired by DenseNet architecture [5], Li et al. [7] proposed H-denseunet for segmentation of liver and liver tumors. Also inspired, Drozdzalet al. [2] systematically studied and analyzed the importance of jump connections, and introduced short jump connections in coding. Although there are slight differences between the above architectures, they all tend to merge semantically different feature maps from the encoder and decoder sub-networks. According to our experimental verification, this approach reduces the performance of segmentation.
  Two other related works recently are GridNet [3] and Mask-RCNN [4]. GridNet is an encoding-decoding structure in which the feature maps are connected into grids by lines, forming several class segmentation structures. However, GridNet lacks an upsampling layer between jump connections; therefore, it cannot represent UNet++. Mask-RCNN may be the most important framework for target detection, classification and segmentation. We think UNet++ can be easily deployed as the backbone architecture in Mask-RCNN, by simply replacing the naive jump connections with densely nested jump paths. Due to limited space, we cannot include the results of Mask RCNN with UNET++ as the backbone architecture. However, readers who are interested in seeing more details can refer to the supplementary information.
Insert picture description here

3 Proposed neural network structure: UNet++

  Fig. 1a shows a general overview of UNet++. We can see that UNet++ starts with an encoding sub-network or a backbone following this decoding sub-network. The difference between UNet++ and U-Net (in Fig. 1a The black part) is the redesigned jumping path (green and blue), which is used to connect the two sub-networks and the use part of deep supervision (red)

3.1 Re-designed skip pathways

  This redesigned jump path changes the connectivity of the encoding and decoding sub-networks. In U-Net, decoding is to directly obtain the feature map in the encoding; however, in UNet++, they pass through a dense convolution block, this block The number of convolutional layers depends on the level of the "pyramid". For example, the jump path between node X0,0 and node X1,3 is composed of a dense convolution block with 3 layers of convolution, and each layer of convolution leads a connection layer, and this connection layer combines the same dense block The output of the previous convolutional layer is fused with the output of the corresponding up-sampled low-density block. In essence, the dense convolution block makes the semantic level of the feature map of the encoder closer to the semantic level of the feature map waiting in the decoder. It is assumed that when the received encoder feature map and the corresponding decoder feature map are semantically similar, the optimizer will face an easier optimization problem.
  From the formula, the jumping path is expressed as follows: let xi,j represent the output of nodes Xi,j.i refers to which downsampling layer along the encoding direction, and j refers to the convolution of the dense block along the direction of the jumping path Layer. This bunch of feature maps are represented by xi, j, and the calculation is as follows:
Insert picture description here
  Function H(·) is a convolution operation and is followed by an activation function. U(·) represents an upsampling layer, and [] represents a connection layer . Basically, the node with level j=0 only receives one input from the previous layer of encoding; the node with level j=1 receives two inputs, both of which come from the encoding sub-network and are two consecutive layers; the layer j>1 The node receives j+1 input, the input j is the output of the previous node j on the same hop path, and the last input is the output from the upsample of the lower hop path. All previous feature maps accumulate and reach the correct node because we use a dense convolution block along each hop path. Fig.1b By showing how the feature map goes through the top jump path of UNet++, it is more geographically clear Eq.1

3.2 In-depth supervision

  We recommend using deep supervision in UNet++ [6] to enable the model to run in two modes: 1) Accurate mode, where the output is averaged from all segmentation branches; 2) Fast model, where the final segmentation map only selects the segmentation branch One, this choice determines the degree of model pruning and the speed gain. Figure 1c shows how the choice of split branches in fast mode leads to architectures of different complexity.

  Due to nested skip paths, UNET++ generates full-resolution feature maps {x0, j, j ∈ {1, 2, 3, 4} at multiple semantic levels, which are acceptable deep supervision. We add the combination of binary cross entropy and Dice coefficient as a loss function to each of the above four semantic levels, described as follows:
Insert picture description here
where Yˆb and Yb represent the flat prediction probability and flatten ground truths of the bth image, and N represents batch processing size.
  In summary, UNet++ as described in Figure 1a is different from the original U-Net in three aspects: 1) There is a convolutional layer (in green) on the jump path, which connects the semantic gap between the encoder and decoder feature maps; 2) In There are dense jump connections on the jump path (shown in blue), which improves the gradient flow; 3) depth monitoring (shown in the picture), as shown in section 4, model pruning and improvement can be performed, or In the worst case, performance equivalent to using only one loss layer can be obtained.
Insert picture description here

4 experiment

Data set : As shown in Table 1, we used four medical image data sets for model evaluation, including lesions/organs from different medical imaging modalities. For further details on the data set and corresponding data preprocessing, please refer to the supplementary material.
Benchmark model : For comparison, we used the original U-Net and wide U-Net. We chose U-Net because it is a general performance benchmark for image segmentation. We also design a wide U-Net with similar parameters as Our proposed architecture. This is to ensure that the performance gains generated by our architecture are not only due to the increase in the number of parameters. Table 2 details the U-Net and wide U-Net architectures.
Implementation details : We monitored the Dice coefficient and Intersectionover Union (IoU), and used an early termination mechanism on the verification set. We also used Adam optimizer with a learning rate of 3e-4. The architecture details of UNet and wide U-Net are shown in Table 2. UNet++ is built from the original U-Net architecture. All convolutional layers use 3×3 (or 3×3 k nuclei for three-dimensional lung nodule segmentation) along the jump path (Xi, j), where k=32×2i. In order to achieve in-depth monitoring, a 1×1 convolutional layer and a sigmoid activation function are attached to each target node. As a result, UNet++ generates four segmentation maps. Given an input image, this will be further averaged to generate the final segmentation map. For more details, please visit github.com/Nested-UNet.
result: Table 3 compares the number parameters and segmentation accuracy of U-Net, wide U-Net and UNet++ in lung nodules, colon polyps, liver and nuclei. As can be seen from the figure, the performance of wide U-Net has always been better than U-Net, except for the liver, the performance of these two architectures is relatively close. This improvement is due to the large number of parameters in wide U-Net. UNet++ has achieved significant performance improvements on UNet and wide U-Net without in-depth supervision, and IoU has improved by 2.8 and 3.3 points on average. UNet++ with deep supervision is 0.6 points better than UNet++ without deep supervision. Specifically, the use of deep supervision can significantly improve the segmentation of liver and lung nodules, but this improvement disappears in the segmentation of nuclei and colon polyps. This is because polyps and liver appear at different scales in video frames and CT slices. Therefore, a multi-scale method using all segmentation branches (deep supervision) is essential for accurate segmentation. Figure 2 is the qualitative comparison result of U-Net, wide U-Net and UNet++.
Insert picture description here
Insert picture description here
Model pruning : After UNet++ has been pruned to different degrees, its segmentation performance is shown in Figure 3. We use UNet++ Li to represent the UNet++ pruned at the i-th level (see Figure 1c for details). It can be seen that the average reasoning time of UNet++ L3 is reduced by 32.2%, and IoU is only reduced by 0.6 points. More aggressive pruning further reduces inference time, but at the cost of a significant decrease in accuracy.

5 Conclusion

  In order to solve the need for more accurate medical image segmentation, we proposed UNet++. The proposed architecture utilizes a redesigned jump path and deep supervision. The redesigned jump path aims to reduce the semantic gap between the feature mapping of the encoder and decoder subnets, thereby solving a possibly simpler optimization problem for the optimizer. In-depth monitoring can also segment lesions more accurately, especially lesions that appear on multiple scales, such as polyps in colonoscopy videos. We evaluated UNet++ using four sets of medical imaging data sets, including lung nodule segmentation, colon polyp segmentation, cell nucleus segmentation, and liver segmentation. Our experiments show that under deep supervision, UNet++ has an average IoU gain of 3.9 and 3.4 points over U-Net and Wide U-Net, respectively.

Two, supplement

The author personally knows the explanation column Unet++ study
Unet series video explanation add link description


Guess you like

Origin blog.csdn.net/qq_43173239/article/details/112870517