NAS-Unet Neural Architecture Search for Medical Image Segmentation

NAS-Unet: Neural Architecture Search for Medical Image Segmentation

insert image description here

Paper address: https://doi.org/10.1109/ACCESS.2019.2908991

Project address: https://github.com/tianbaochou/NasUnet

IEEE Access 2019

ABSTRACT

Neural Architecture Search (NAS) has made significant progress in improving image classification accuracy. In recent years, some studies have attempted to extend NAS to image segmentation, showing preliminary feasibility. However, they all focus on searching architectures for semantic segmentation in natural scenes. In this paper, three types of primitive operation sets are designed on the search space for semantic image segmentation, especially in medical image segmentation to automatically find two cell structures, DownSC and UpSC. Inspired by the successful application of the U-net architecture and its variants to various medical image segmentation, we propose NAS-Unet stacked by the same number of DownSC and UpSC on a u-like backbone. In the search phase, the architectures of DownSC and UpSC are updated synchronously through differential architecture strategies. We demonstrate the good segmentation performance of our method on the Promise12, Chaos and ultrasound neural datasets acquired by magnetic resonance imaging, computed tomography and ultrasound, respectively. Without any pre-training, our architecture searches on PASCAL VOC2012, obtains better performance than U-net and its variants and less parameters (about 0.8M).

I. INTRODUCTION

With the development and popularization of medical image analysis equipment, magnetic resonance imaging (MRI), computed tomography (CT), ultrasound, etc. have become indispensable equipment for medical institutions to carry out disease diagnosis, surgical planning and prognosis assessment. Magnetic resonance imaging is the most widely used technique in the field of radio imaging. One of the outstanding features of MRI imaging is the diversity of imaging sequences. In MRI, the contrast of the image depends on the phase contrast pulse sequence. The most common pulse sequences are T1 (spin lattice; that is, magnetization in the same direction as the static magnetic field) weighted and T2 weighted spin (spin prime; transverse to the static magnetic field). MRI can provide different information than CT. MRI scans can be risky and uncomfortable. MRI scans typically take longer and are louder than CT scans, and they often require the subject to enter a narrow, closed tube. Ultrasound imaging (ultrasound imaging) uses high-frequency sound waves to see inside the body. Unlike CT and MRI, ultrasound images have relatively low resolution.

Medical image analysis is the first step in medical image analysis, which helps to make images more intuitive and improve diagnostic efficiency. Medical image segmentation is a key step in the field of medical image analysis. In order to provide a reliable basis for clinical diagnosis and pathological research and help doctors make more accurate diagnoses, it is necessary to segment the part of the medical image we care about and extract relevant features. Initially, medical image analysis addressed specific tasks by sequentially applying low-level pixel processing (such as region-based methods [1] or threshold-based methods [2]) and mathematical modeling to construct compound rule-based systems [3]. Segmentation results in this period are generally not semantically labeled. In the era of deep learning, image segmentation generally refers to semantic segmentation, that is, to identify images at the pixel level (label the object category to which each pixel in the image belongs) [4], [5]. For example, in Figure 1, the medical image in the left image is composed of bladder wall and other tissues, and the right image is the result of its semantic segmentation, and the pixel semantic object is segmented, that is, the bladder wall marked in yellow. Similarly, the other Tissues are considered as background and marked in purple. To date, the most successful deep learning model for image analysis is the convolutional neural network (CNN).

insert image description here

The mutual promotion of deep learning, big data and cloud computing has brought great development to computer vision [6]. CNN is the most commonly used neural network in the field of computer vision, which was proposed to solve the problem of image classification. Image segmentation is a common task in natural image and medical image analysis. To solve this problem, a CNN can simply classify each pixel in an image individually, render it by extracting patches around a particular pixel, and produce a multi-channel likelihood map of the same size as the input image. However, it consumes a lot of memory while maintaining the dimensionality of the feature maps all the time. More commonly, a downsampling layer (such as max pooling and average pooling) is added after several convolutional layers to reduce the dimensionality of feature maps and refine high-level context. Unfortunately, this can result in an output resolution that is much lower than the input resolution. fns (Fully Convolutional Networks, fully convolutional network) [7] is one of several methods to prevent resolution degradation. This is the first work to train end-to-end pixel-wise predictions by replacing fully connected layers with convolutional layers followed by a series of upsampling layers. Classical cnn generally uses a fully connected layer to obtain a fixed-length feature vector after the last layer of convolution, and puts it into a classifier (such as a softmax layer). In contrast, FCN can accept input images of any size, and the upsampling layer after the last convolutional layer can restore the dimensionality of its input to be the same as that of the input image, thereby preserving the spatial information of the original input image while Predictions are made per pixel, and finally the upsampled features are classified pixel-by-pixel and mapped to the desired image segmentation. Similar to FCN, U-net [8] consists of convolutional layers, downsampling layers and upsampling layers. Different from FCN, the number of downsampling layers and upsampling layers of U-net and the number of convolutional layers between them are the same. In addition, U-net uses jumping operations to connect each pair of downsampling layers and upsampling layers, so that spatial information is directly applied to deeper layers, and the segmentation results are more accurate.

From the earliest LeNet [9] to AlexNet [10], VggNet [11], GoogleNet [12], ResNet [13] and the latest DenseNet [14], the performance of the CNN model is getting stronger and more mature. Many works design network structures for specific tasks [15], [16]. Currently, these popular network architectures are designed by industry experts and academics over months or even years. This is because designing a network architecture with excellent performance often requires a lot of domain knowledge. Typical researchers do not have this capability, and the design process is time-consuming and labor-intensive. Based on this, the current research focus of convolutional neural networks has shifted to neural architecture search (neural architecture search, NAS) [17]. NAS can be seen as a subfield of AutoML (Automated Machine Learning), with significant overlap with hyperparameter optimization and meta-learning. The current research on NAS mainly focuses on three aspects: search space, search strategy and performance evaluation strategy. The search space defines which architectures can in principle be represented. Incorporating prior knowledge suitable for task properties can reduce the size of the search space and simplify the search task. For example, in image classification, the search space includes the choice of primitive operations at each search step and the prior backbone architecture used to define the outer network. A search strategy details how to explore the search space. The goal of NAS is usually to find architectures with high evaluation performance on unseen data (e.g., split the training dataset into training and validation, and search for architectures on training, but evaluation on validation) [17]. Much work has been done on NAS, most of which focus on image classification tasks [18]–[23].

Although NAS has great potential in the field of computer vision, its real promise depends on whether it can be extended to deal with vision tasks beyond image classification, especially image semantic segmentation, which relies on high-resolution image input and multi-scale image representation. Core computer vision problems such as segmentation and object detection. It is infeasible to introduce NAS directly from image classification to image semantic segmentation: first, the search space of the classification task is significantly different from image semantic segmentation; second, the idea of ​​transfer learning from low to high image resolution is unexpected of [24]. The logical idea to solve the above two problems is to build a specific image segmentation search space and search the architecture with high-resolution images. Some works have been trying to solve the above two problems with some success—this is exactly the line of thought followed by recent work [24], [25].

This paper proposes a new set of primitive operations for medical image segmentation. Inspired by the success of U-net and its variants in medical image segmentation, we use a U-like architecture as our backbone network (i.e., the outlier network) and search two algorithms based on PASCAL VOC 2012 [26] in parallel. The architecture of the cell (downsampling cell and upsampling cell), denoted as DownSC and UpSC, respectively. Finally, we get our architecture denoted as NAS-Unet, which is stacked by the same number of DownSC and UpSC. Our work shows that NasUnets is more efficient in terms of parameter usage and achieves better performance than U-Net and FC-DenseNet [27] (a variant of U-Net) in all types of medical image datasets we mentioned before. good performance. In summary, our contributions are as follows:

  1. This paper is the first attempt to apply NAS to medical image segmentation.

  2. On the u-shaped backbone network, we propose different sets of primitive operations for searching on DownSC and UpSC respectively. When the search is done, we empirically find that in our UpSC architecture, standard skip connections are replaced by weight operations (see V-A).

  3. We show that NAS-Unet outperforms U-Net and its variant (FC-Densenet) on all types of medical image segmentation datasets we evaluate, without using any pretrained backbone. The training time of NAS-Unet is close to that of U-Net, but the number of parameters is only 6%. FC-Densenet has twice the memory cost of ours.

II. RELATED WORK

A. Modern CNN-Based Medical Image Segmentation

To our knowledge, Ciresan et al. [28]. The first application of deep neural networks to medical image segmentation. Segmentation stacks of electron microscope images are hallmarks of using convolutional neural networks. "Patch" is the key idea to accomplish the segmentation - in order to segment the entire stack, the classifier is applied to each pixel in each slice in a sliding window by extracting the patch around the pixel. A disadvantage of this naive sliding window approach is that the input patches from adjacent pixels have substantial overlap and redundant computation. It is also pointed out in [28] that splitting the entire stack in this way is time inefficient, taking at least 10 minutes for a stack on four GPUs. Ronneberger et al. [8] rewrote fully connected layers as convolutions and tried the same task with better results. As shown in Fig. 2(a), the author further borrows the idea of ​​FCN [7] and proposes the U-Net architecture, whose design is based on the encoder-decoder network framework: put the input image into the encoder architecture to extract high-level context , and then stream the context into the decoder architecture to restore spatial information and pixel classification results. Although this is not the first time an encoder-decoder has been used in a convolutional neural network (e.g. Shelhamer et al. [7] used a pretrained modern CNN network as the encoder and an "up" convolutional layer as the decoder (FCN- 32)), but the authors combine it with horizontal skip connections directly connecting the opposite shrinking and dilating convolutional layers.

After the U-Net network was proposed, it performed well in the field of medical image segmentation. Many researchers have made various improvements in this regard. Çiçek et al. [29] first proposed a 3D U-Net network architecture to achieve 3D image segmentation by inputting a sequence of continuous 2D slices of 3D images. Milletari et al. [30] proposed a U-Net based 3D warping architecture V-net. The V-Net architecture directly minimizes this commonly used segmentation error metric by using the Dice coefficient loss function instead of the cross-entropy loss function. The author further introduces residual blocks based on the original U-shaped design. Both methods expand the U-shaped structure with a 3D convolution kernel.

[31] distinguished between long skip connections (i.e., skip connections between two feature maps that are far apart) and short skip connections (often referred to as single residual blocks), and found that both benefit Create deep architectures for medical image segmentation. Simon et al. [27] combined Dense-Connected Convolutional Networks (DenseNets) with u-shaped architecture, replaced the convolutional layer in the backbone with Dense Block, and extended it to natural image segmentation, achieving good results. performance.

B. Neural Architecture Search

Designing a good neural network architecture is time-consuming and laborious. In order to reduce the workload and resource cost of manually designing the network architecture, some scholars have focused on neural architecture search (NAS). Currently, most studies on image classification focus on searching CNN architectures, while less research on RNNs for language tasks. As we mentioned earlier, NAS consists of three components: search space, search strategy, and performance estimation. Search algorithms mainly include heuristic algorithms [19], [21], [32]-[34], reinforcement learning [35]-[38], [38], [39], Bayesian optimization methods [40], [41] and gradient-based methods [20], [42], [43]. Performance evaluation can be understood from two aspects. First, the performance of candidate architectures is evaluated to determine whether to preserve (or expand) for the next update. Second, we need a deeper network stacked by cells (when using a cell-based search space) or the current candidate architecture, and put the training dataset into it for training and evaluating the final performance.

The network search space includes the topology of nodes and the operations between each connected node. The former try to directly construct the whole network architecture [36], [44]. However, since NASNet [37] successfully stacks cells on ImageNet, recent works [20], [22], [23], [45], [46] are more inclined to search for reproducible cell structures, but first Keep the backbone network fixed. The latter can improve search efficiency. In recent years, in a large amount of research in the field of NAS, many efficient node topology generation algorithms are proposed, which are based on powerful and tractable architecture search spaces. In fact, strong results can be obtained even with random search if we have a rich and not too bloated search space [20], [40]. Therefore, this paper mainly studies the construction of hierarchical search space for cell-based medical image segmentation. Furthermore, we use the current differential architecture search method [20], [22] as our search algorithm to speed up our search process.

C. Application of NAS in Image Segmentation

NAS has mainly solved image classification tasks since it was proposed. There are some recent studies applying NAS to image segmentation. Chen et al. [24] first introduced NAS to solve the image segmentation problem. The authors show that architecture search outperforms human-invented architectures and achieves better performance on many split datasets, even when random search is used in building the recursive search space. However, instead of using one-shot search, this work focuses on searching a Small Spatial Pyramid Pooling (ASPP) module named DPC (similar to a decoder) and fixes the pretrained backbone (modified Xception) for the encoder. Liu et al. [25]. Propose Auto-DeepLab: a general network-level search space and joint search across two levels (network-level and cell-level architectures). The authors point out that the search space includes various existing designs such as DeepLabv3, ConvDeconv, and stacked hourglass. However, the search space of Auto-DeepLab does not include the U-Net architecture, which is the most famous architecture in the field of medical image segmentation.

The work most similar to ours is [27], where we use dense blocks instead of convolutional layers in both the contraction and expansion stages. However, we replace all predesigned cells with cells searched by the NAS method

III. CELL-BASED ARCHITECTURE SEARCH SPACE

In this section, we first describe a general representation of the CNN architecture we use. We will show how to represent the cell structure as a DAG. We then introduce search spaces for medical image segmentation. Finally, we will detail the two types of cell structures.

A. CNN Architecture Representation

Use a directed acyclic graph (DAG) to represent the network topology, where each node hi represents the input image or feature map, and each edge eij e_{ij}eijwith node hi h_ihiAssociated with operations between nodes $h_j% (such as convolution operations, pooling operations, and skip connections). When the generation method of DAG is not limited, its network architecture space will be very large, which will bring great challenges to the existing search algorithms. Therefore, we use a cell-based architecture. When determining the optimal cell structure, we can stack cells into deeper networks on the backbone network. In other words, the architecture of the cell is shared by the entire network.

B. Search Space for Medical Image Segmentation

In this section, we present the selection of the base set of operations for the DownSC and UpSC architectures. Afterwards, we describe how to construct them.

1) THE SELECTION OF PRIMITIVE OPERATIONS

How to choose an appropriate operator? We study popular CNN architectures and pre-NAS with great success in image classification, and summarize the important criteria for selecting primitive operations in our work:

  • No redundancy : This means that each primitive operation should have some unique properties and cannot be replaced by other operations. Although some works [25], [34] indicate that 5×5 convolutional search may be considered in the process. Large receptive fields, such as convolutions of size 5 × 5 and convolutions of size 7 × 7, can be replaced by stacking enough convolutions of size 3 × 3. Therefore, all convolution operations will be limited to a size of 3 × 3, and pooling operations will be limited to 2 × 2.
  • Less parameters : Means less memory resources are consumed during the search; the original U-Net required about 31 million parameters, which is huge for mobile devices. In our work, a depthwise separable convolution operation will be introduced as it can significantly reduce network parameters without sacrificing network performance.

When the sliding stride (stride value) is set to 2, the convolution operation can either halve or double the dimensionality of the feature map, the latter being called "up" convolution. This shows that down and up operations can be derived from the same base operation. In contrast, unlike the original operations in image classification, where the "up" versions of some operations do not make sense (e.g. identity operations), the "up" versions of pooling operations (e.g. average pooling and max pooling) do not exist. For convenience, we construct 3 different types of sets of primitive operations.

insert image description here

As shown in Table 1, depthwise convolution represents a depthwise separable operation, and other operations are commonly used in current NAS methods except atrous convolution [47] and cweight [48]. The Cweight operation is a squeeze-and-excitation operation [48]. In early CNN architectures, the features we generate for all channels are directly combined uniformly. A natural next step is to automatically learn the weights for each channel. This is exactly what the squeeze incentive operation does. The compressed excitation operation suppresses some redundant features and enhances useful features by assigning weights to feature channels. Down cweight and up cweight operations halve or double the dimensionality of feature maps before channel reweighting. It is worth noting that when previous NAS papers showed their good architectures on image classification tasks, dilated (converged) convolution operations hardly appeared. However, the original intention of this operation is to solve the problem of image segmentation. As we mentioned before, unlike image classification tasks, image segmentation search architectures require high-resolution inputs. Huge memory consumption is certainly noticeable. For example, a 512 × 512 image, using the original U-Net architecture to predict the results, the batch size does not exceed 4, and the model is loaded on a 12GB Titan pascal GPU.

We use the Conv-ReLU-GN order for all convolution operations. GN stands for Group Normalization [49], and Wu et al. show that this normalization is superior to batch normalization, especially when the batch size is much smaller. Since the batch size of segmentation tasks is much smaller than that of image classification, we use group normalization instead of batch normalization.

C. Two types of cell structures

insert image description here

As shown in Fig. 2(b), we design two types of cell architectures based on the U-shaped backbone, called DownSC and UpSC. In these two cells, the input nodes are defined as the cell outputs of the first two layers [20], [37]. As shown in Figure 3, all operations adjacent to the input node are either Down PO or Up PO, set H = hi H = h_iH=hiIt is a set of M intermediate nodes (or feature mapping layer). Same as DARTS [20], the total number of edges between all intermediate nodes and input nodes is E = 2 M + M ( M − 1 ) / 2 E = 2M + M(M−1)/2E=2 M+M(M1)/2

insert image description here

In the contraction step, we concatenate L 1 L_1L1cell to learn different levels of semantic context information and generate a smaller probability map called DC out DC_{out}DCout. Likewise, in the unroll step, we restore DC out DC_{out} with the same number of cellsDCoutThe spatial information of each probability value in , and expand it to be consistent with the input image. The total number of cells in the final network, denoted as Nas-Unet, is L = 2 L 1 L = 2L_1L=2 L1. Unlike the FC-densenet architecture [27], we not only replace the convolutional layers with these cells, but also move the upsampling and downsampling operations into the cells. In other words, both normal operations (such as identity operations) and up/downsampling operations are taken into account in the cell. As shown in Figure 2(b), conversion is also the operation of Norm PO in UpSC. Our search space covers many popular u-shaped architectures, such as U-Net [8] and FC-DenseNet [27]. It is worth noting that the original U-Net architecture has an extra convolutional layer in the middle of the network. In our paper, however, we do not follow this experience because we have a strictly symmetric architecture consisting of several pairs of cells stacked together.

IV. SEARCH STRATEGY

We first describe how to build an over-parameterized network that includes all candidate paths, drawing on recent research [20], [22], [50]. We then introduce a more efficient architecture parameter update strategy to save GPU (since CPU is much slower during the search and we have to use GPU) memory [50].

A. Over-parameterized cell structure

Given cell structure C ( e 1 , ⋯ , e E ) C\left(e_1, \cdots, e_E\right)C(e1,,eE) , among whichei e_ieiRepresents an edge in the DAG. Let O = oi O=o_iO=oiFor the above three with NNOne of the base operations set of N candidate operations. Instead of associating each edge with a definite operation, we set each edge to haveNNThe mixing operation of N parallel paths (as shown in FIG. 4 ) is denoted as MixO.

insert image description here

Therefore, the over-parameterized cell structure can be expressed as C ( e 1 = M ix O 1 , ⋯ , e E = M ix OE ) C\left(e_1=M ix O_1, \cdots, e_E=M ix O_E\right )C(e1=M i x O1,,eE=M i x OE) . The output of the mixing operation MixO is defined in terms of the output of its N paths:
Mix ⁡ O ( x ) = ∑ ( i = 1 ) N wioi ( x ) (1) \operatorname{Mix} O(x)=\ sum_{(i=1)}^N w_i o_i(x) \tag{1}MixO(x)=(i=1)Nwioi(x)( 1 )
As shown in Formula 1,wi w_iwidisplay oi o_ioiThe weight of is a constant 1 in One-Shot[50], while in DARTS[20] it is passed to NNN real-valued architectural parameters{ α i } : e α i / ∑ je α j \left\{\alpha_i\right\}: e^{\alpha_i} / \sum_j e^{\alpha_j}{ ai}:eai/jeajCalculated by applying softmax. α i \alpha_iaiThe initial value of 1 / N 1 / N1/N

B. GPU memory saving update strategy

In the above, the output of each edge is a mixture operation of N candidate original operations, which means that the output feature maps of all N paths can only be computed when all operations are loaded into GPU memory. However, training a compact model uses only one path. Therefore, [20] and [50] require about N times more GPU memory than training a compact model. In this paper, we use binary gates proposed by Cai et al. to learn binarization paths instead of N paths [22]. The difference between DARTS and the binary gate method (representing ProxylessNAS) is that the former updates all architectural parameters through gradient descent at each step, while the latter only updates one of them.

insert image description here

As shown in Figure 5, when training network weight parameters, we first freeze the architecture parameters and randomly sample binary gates for each batch of input data. The weight parameters of the active paths are then updated by standard gradient descent on the training dataset. When training the architecture parameters, the weight parameters are frozen, then we reset the binary gates and update the architecture parameters in the validation set (Fig. 5(a)). These two update steps are performed in another way. Once the training of the structure parameters is completed, we can obtain a compact structure by pruning redundant paths. In this work, we simply select the k (k = 2, for our work) paths with the highest path weights as input. In summary, in this way, regardless of the value of N, each update of the architecture parameters involves only two paths, thereby reducing the memory requirement to the same level as training a compact model. It is worth noting that the ProxylessNAS method only considers two paths for updating in each update step, which will result in much less training for operations not on two paths than for operations on two paths. Therefore, we need more iterations to update. Also, moving feature maps that are not in GPU memory to the GPU takes extra time.

V. EXPERIMENTAL RESULTS

Here we show the details of how to implement NAS-Unet. Afterwards, we report medical image segmentation results on benchmark datasets, with our network stacked from the best found cells.

A. NAS-Unet implementation details

We consider the number of intermediate cells M = 4 for DownSC and UpSC, and the total number of cells L = 2L1 = 8. The search space of DownSC is about 6 6 + 5 8 = 437281, and the search space of UpSC is about 6 6 + 4 8 = 112192. Therefore, the total size of the search space is on the order of 10 10 , much smaller than [25]. Unlike DARTS, we do not follow the practice of doubling the number of filters when the height and width of the feature maps are halved.

We conduct architecture search on the PASCAL VOC 2012 dataset [26] for medical image segmentation. More specifically, we use 480 × 480 random image crops. We randomly sample half of the images in the training set as the validation set. When we use the DARTS search strategy, the batch size is 2, and the architecture search optimization is performed for a total of 120 epochs. When we use the binary gate update strategy, the batch size can be 8, but requires 200 epochs ( see Section IV-B ).

Since the focus of this paper is to construct an efficient cell search space for medical image segmentation, any method with a differential search strategy can work [20], [22], [23], [50]. We wish to search for proxy cell architectures on a very complex image dataset (PASCAL VOC 2012 dataset [26]) and transfer them to a medical image dataset. So in our experiments, we used the DARTS update strategy. It is also possible to use ProxylessNAS, but these datasets need to be searched separately.

When learning the network weight w, we use the SGD optimizer with a momentum of 0.95, the cosine learning rate decays from 0.025 to 0.01, and the weight decays by 0.0003 [20]. When learning structure α, we use the Adam optimizer [51] with a learning rate of 0.0003 and a weight decay of 0.0001. We empirically found that when we optimize α \alpha from the beginningα or follow [25] - when optimization starts after a constant epoch (like 50), mean intersection (mIoU) and pixel accuracy (pixAcc) increase slowly (Fig. 6). Therefore, we start withα \alphaα is optimized. The entire architecture search optimization takes about 1.5 days on a Titan Pascal GPU.

insert image description here

The DownSC and UpSC we searched are shown in Fig. 7(a) and (b).

insert image description here

As can be seen from Figure 7, the search processing in our search space is more inclined to choose the "cweight" version operation (including down cweight operation, up cweight operation and cweight operation), because the "cweight" version operation is in the DownSC and UpSC architectures accounted for a large proportion. It is worth noting that the cweight operation replaces the standard skip connection to pass high-resolution information (including more precise spatial information and high-level semantic information) between the downsampling and upsampling paths (this is exactly what the gray transformation indicated by the arrow). This means that delivering high-resolution information is not a simple concatenation, but a weighted concatenation.

B. Medical Image Segmentation Results

To evaluate the performance of NAS-Unet, we used three types of medical image datasets (magnetic resonance imaging (MRI), computed tomography (CT) and ultrasound): Promise12 [52], Chaos [53] and NERVE [54 ]data set. The weights of all models are updated by minimizing the negative Dice similarity coefficient (DSC) function representing the Dice loss. We use DSC and mean cross-linkage (mIOU) to evaluate the performance of the model. The baseline methods are U-Net[8], FC-Densenet[27]. To be fair, we reimplement them using Pytorch [55] with the same data augmentation (we also try to improve the quality of some noisy images [56], [57]). In addition, we use the same Adam optimizer with an initial learning rate of 3.0e-4 and a weight decay of 5.0e-5 for 200 training epochs.

1) PROMISE12

Promise12 [52] contains 50 training cases consisting of transverse T2-weighted MR images of the prostate. The training set has about 1250 images with corresponding labels (only voxel values ​​0 and 1). Each 2d MRI slice was resized to 256 × 256 dimensions, and histograms were equalized using contrast-limited adaptive histogram equalization (CLAHE). The training dataset is divided into 40 training cases and 10 validation cases. As shown in Table 2, our model outperforms all baseline methods without any pre-training. Train Time and GM respectively represent the training time cost (total number of days and hours) when the batch size is 2 and the GPU memory cost when the batch size is 2 (see the table below).

insert image description here

2) CHAOS

The Chaos [53] challenge will be held at the IEEE International Symposium on Biomedical Imaging (ISBI), 8-11 April 2019 in Venice, Italy. The challenge will start with the ISBI meeting. Five competitions will use two databases (abdominal CT and MRI), and we choose two: Liver Segmentation (CT only) and Abdominal Organ Segmentation (MRI only). The first challenge is to segment the liver from a computed tomography (CT) dataset, and the second challenge is to segment four abdominal organs (i.e., liver, spleen, right kidney, and left kidney) from a magnetic resonance imaging (MRI) dataset. Each dataset in these two databases corresponds to a series of DICOM images belonging to a single patient. The first database contains CT images of 40 different patients. In total, 2874 patches (512 × 512 each) are provided for training and 1408 for testing. The second database included 120 DICOM datasets from two different MRI sequences (T1-DUAL period (40 datasets), extraperiod (40 datasets) and T2-SPIR (40 datasets)). The resolution of the dataset is 256 × 256, and the number of slices ranges from 26 to 50 (average 36). In total, 1594 slices (532 slices per sequence) will be provided for training and 1537 slices for testing.

Currently, we use 2874 slices of CT images and 940 slices of MR images to evaluate our model. As shown in Table 3, our model outperforms all baseline methods without any pre-training on both CT images and MR images. It is worth mentioning that in the MR image dataset, we improved the class imbalance problem by reweighting the five classes as the Dice loss function (the frequency ratio of the five classes is 1066:40:3.7:4.1). The batch size is also set to 400.

insert image description here

3) ULTRASOUND NERVE

Ultrasound Nerve Segmentation is a Kaggle challenge in 2016. The task of this competition is to segment a group of nerves called the brachial plexus (BP) in ultrasound images. Some images (about 60% of the training set) do not contain the brachial plexus region. The size of the image is 580 x 420 pixels. There are 5635 training images and 5508 testing images (20% of which are used for public ranking and 80% for final ranking). The training dataset contains many contradictory images, meaning two very similar images, but one of which has a non-null mask and the other has an empty mask (as shown in Figure 8).

insert image description here

Therefore, we follow the Juliean method [58] to remove contradictory images by computing the signature of each image. The difference after that is the cosine distance between the two signature vectors of the two images. This will get a distance matrix of all training set images, which is then thresholded to decide which images should be removed. In the end, we kept 4456 training images (out of 5635). Finally, we randomly split 0.2 training images as the validation set. As shown in Table 4, without any pre-training, our model outperforms all baseline methods.

insert image description here

VI. CONCLUSION

In this paper, we attempt to extend neural architecture search to medical image segmentation. We design three types of primitive operation sets for our search space, and a search cell-based architecture stacked by DownSC and UpSC. We choose a U-like backbone (our search space includes U-net and its many variants) to search, and introduce a memory-efficient search algorithm (Binary gate) [22] to speed up the search process. The search results, NAS-Unet, are evaluated by training from scratch on a medical image segmentation dataset. On Promise12, NAS-Unet significantly outperforms baseline methods. NAS-Unet also outperforms these baseline methods in terms of Chaos and Ultrasound Nerve.

Guess you like

Origin blog.csdn.net/weixin_43790925/article/details/130956647