SCAN: Structure Correcting Adversarial Network for Organ Segmentation in Chest X-rays(译)

ps: mechanical translation of a paper

Summary:

  Chest X-ray (CXR) is one of the most commonly used medical imaging procedures, and its scan volume is usually 2 to 10 times higher than other imaging methods (such as MRI, CT scan, and PET scan). These large numbers of CXR scans add a lot of workload to radiologists and medical practitioners. Organ segmentation is a key step to obtain effective computer-aided detection on CXR. In this work, we propose a Structure Correction Adversarial Network (SCAN) to segment the lung field and heart in the CXR image. SCAN includes a critic network to impose structural regularities from human physiology on the convolutional segmentation network. During the training process, the critic network learned to distinguish the "ground real organ annotations" from the mask synthesized by the segmentation network. Through this confrontation process, the critic network learns higher-order structures and instructs the segmentation model to achieve realistic segmentation results. A large number of experiments show that our method can produce highly accurate and natural segmentation. Using only very limited training data, our model can achieve human-level performance without relying on any existing trained models or data sets. Our method can also well generalize the CXR images of different patient groups and disease profiles, thus surpassing the current state of the art.

1 Introduction

  Chest X-ray (CXR) is one of the most common medical imaging methods. Due to the low cost and low dose of CXR radiation, hundreds to thousands of CXR are generated in a typical hospital every day, which will generate a lot of diagnostic workload. In 2015/16, the British public medical department required more than 22.5 million X-ray images, accounting for more than 55% of the total number of medical images, and dominated all other imaging methods, such as computer tomography (CT) scans (4.5M) and MRI (3.1M) [8]. In X-ray images, 8 million are chest X-rays, which is equivalent to thousands of CXR readings per radiologist each year. In developed countries [19, 18], not to mention developing countries [1], the shortage of radiologists has been fully documented. Compared with more modern medical imaging techniques (such as CT scans and PET scans), X-rays pose a challenge for diagnosis due to their low resolution and two-dimensional projection. Therefore, it is important to develop computer-aided detection methods that support chest X-rays to support clinicians.

  In computer-aided CXR image detection, an important step is organ segmentation. The segmentation of the lung field and the heart provides a wealth of structural information on irregular shapes and size measurements, which can be used to directly assess certain serious clinical conditions, such as cardiac hypertrophy (enlarged heart), pneumothorax (collapsed lung), pleural effusion, And emphysema. In addition, a clear lung area mask can also improve the interpretability of computer-aided detection, which is important for clinical use.

  One of the main challenges of CXR segmentation is to incorporate the implicit medical knowledge involved in contour determination. In the most basic sense, the positional relationship between the lung field and the heart implies the abutment between the lung and the heart mask. In addition, when medical experts annotate the lung fields, they will look for certain consistent structures around the lung fields (Figure 2). As shown in Figure 1, this prior knowledge helps to resolve the boundary around a less clear area caused by pathological conditions or poor imaging quality. Therefore, a successful segmentation model must effectively use global structural information to resolve local details.

  Unfortunately, unlike natural images, CXR training data with pixel-level annotations is very limited due to the expensive tag purchases involving medical professionals. In addition, CXR shows great differences in different patient populations, pathological conditions, imaging techniques and surgery. Finally, CXR images are grayscale images, which are completely different from natural images, which may limit the transferability of existing models. Existing CXR organ segmentation methods usually rely on hand-made functions, which may be fragile when applied to different patient groups, disease characteristics and image quality. In addition, these methods do not clearly balance the local information and the global structure in a principled manner, which is essential for achieving segmentation results suitable for diagnostic tasks.

  In this work, we propose to use the Structure Correction Confrontation Network (SCAN) framework combined with the critic network to guide the convolutional segmentation network to achieve accurate and realistic chest X-ray organ segmentation. By using the convolutional network method for organ segmentation, we avoid the problems faced by existing methods based on ad hoc feature engineering. Only our convolutional segmentation model can achieve performance that competes with existing methods. However, due to limited training data, the segmentation model alone cannot capture enough global structure to generate natural contours. To impose regularization based on physiological structure, we introduce a critic network that distinguishes the mask synthesized by the segmentation network and ground truth annotations. The segmentation network and the reviewer network can be trained end-to-end. Through this confrontation process, the criticism network can learn higher-order rules and return effective global information to the segmentation model, thereby achieving realistic segmentation results.

  We prove that even when trained on a small data set, SCAN can still produce highly realistic and accurate segmentation results without relying on any existing models or data from other fields. With the help of global structural information, our segmentation model can solve difficult boundaries that require a lot of prior knowledge. Using IoU as an evaluation indicator, SCAN can definitely increase the segmentation model by 1.8%. For lung field and heart, it can reach 94.7% and 86.6% respectively. Both of these are works of art produced by a new current single model. Compete with human experts (94.6% for lungs and 87.8% for heart). We further prove that when the SCAN model is applied to a new, invisible data set, its robustness is 4.3% higher than that of the vanilla segmentation model.

2. Related work

   Our research focuses on the two types of literature most relevant to our problem: lung field segmentation and semantic segmentation by convolutional neural networks. Lung field segmentation. Existing work on lung field segmentation can be roughly divided into three categories [30]. (1) The rule-based system applies a set of predefined thresholds and morphological operations, which are derived from heuristic methods [12]. (2) The pixel classification method classifies pixels as the inside or outside of the lung field according to the intensity of the pixel [37, 15, 16, 2]. (3) Recent methods are based on deformable models, such as Active Shape Model (ASM) and Active Appearance Model [7, 6, 28, 29, 24, 32, 23, 33]. Due to the adjustment parameters and whether the shape model is initialized to the actual boundary, their performance may vary greatly. Similarly, the high contrast between the rib cage and the lung field can cause the model to fall into a local minimum. Our method uses a convolutional network to perform end-to-end training from image to pixel mask without using temporary functions. The proposed adversarial training further integrates previous structural knowledge into a unified framework.

  The latest method currently used for lung field segmentation uses a registration-based method [3]. In order to establish a lung model for the test patient, [3] found the patient most similar to the test patient in the existing database, and linearly deformed their lung contours based on key point matching. This method relies on the existing lung contours and "key points for correct matching" to model the test patients well. Both methods are fragile in different populations.

  Semantic segmentation of convolutional networks. The purpose of semantic segmentation is to assign a predefined class to each pixel, which requires advanced visual understanding. The current state-of-the-art methods for semantic segmentation use Fully Convolutional Networks (FCN) [14, 35, 5, 13]. Recently [17] applied adversarial training to semantic segmentation and observed some improvements. These works have solved natural images with color input, and have been pre-trained by models such as the VGG network [27], combined with learning from large-scale image classification [22]. We adjusted the FCN to grayscale CXR images under the strict constraints of a very limited training data set of 247 images. Our FCN is different from the conventional VGG architecture and can be trained without transferring learning from existing models or data sets.

  In addition, U-net [21] and similar architectures are popular convolutional networks for biomedical segmentation-and have been applied to neuron structures [21] and histological images [4]. In this work, we propose to conduct adversarial training on the existing segmentation network to enhance the overall consistency of the segmentation results.

  We have noticed that recently there are more and more work applying neural networks end-to-end on CXR images [25, 34]. These models directly output clinical goals, such as disease labels, without clear intermediate output to help explain. In addition, they usually require a large number of CXR images for training, which is not easy to obtain in many clinical tasks involving CXR images.

3. Problem definition

  We solved the problem of segmenting the left lung field, right lung field and heart on chest X-rays (CXRs) in the posterior anterior (PA) view, where the radiation passes through the patient from back to front. Due to the fact that CXR is a 2D projection of a 3D structure, the organs overlap significantly, so care must be taken when defining the lung field. We use the definition of [31]: the lung visual field consists of all pixels that radiate through the lungs but not through the following structures: the heart, the mediastinum (the opaque area between the two lungs), below the transverse diaphragm, the aorta and the superior cavity Veins (if visible) (Figure 2). Heart borders are usually visible on both sides, and due to mediastinal occlusion, the top and bottom borders of the heart must be inferred. As shown in Figure 1, this definition covers the "common concept of lung field and heart" and includes areas related to CXR readings in a clinical setting.

4. Structure correction countermeasure network

  We use the proposed structure correction expert network (SCAN) framework to introduce in detail the method of semantic segmentation of lung field and heart. In order to adapt to the special problem setting of CXR images, we developed our network architecture from scratch following best practices and extensive experiments. Using a data set that is an order of magnitude smaller than the ordinary semantic segmentation data set of natural images, our model can be trained from start to finish to excellent generalization capabilities without relying on existing models or data sets.

4.1 Adversarial training for semantic segmentation

  Adversarial training was first proposed by GAN in the context of generative modeling 1 [9]. The GAN framework consists of a generator network and a commenter network that participate in adversarial two-person games. The generator aims to learn the data distribution, and the commenter estimates the possibility that the sample comes from the training data, rather than through the generator. The purpose of the generator is to "maximize the probability of a critic making a mistake, and the critic is optimized to minimize the possibility of making a mistake. It turns out that the generator will generate highly realistic samples (such as images) [20]. 

  A key insight in the confrontation process is that the reviewer (which can be a complex neural network) can learn to "use higher-order inconsistencies in the samples generated by the generator". Through the interaction between the generator and the annotator, the annotator can instruct the generator to "generate a sample that is more consistent with the higher-order structure in the training sample, thereby making the data generation process more "real".

  The higher-order consistency enforced by critics is particularly useful for CXR segmentation. Although human anatomy shows great differences between individuals, it usually maintains a stable relationship between physiological structures (Figure 2). Thanks to standardized imaging procedures, CXR can also provide a consistent view of these structures. Therefore, we can expect critics to learn these higher-order structures and guide the segmentation network to generate masks that are more consistent with the learned global structure.

  We recommend using adversarial training to segment CXR images. Figure 3 shows the overall SCAN framework that combines the confrontation process with semantic segmentation. The framework consists of a network of subdivisions and a network of commenters that are jointly trained. The segmentation network performs pixel-level prediction on the target category and plays the role of a generator in GAN, but it is based on the input image. On the other hand, the reviewer network takes the segmentation mask as input and outputs the probability that the input mask is the ground truth annotation, not the prediction of the segmentation network. The network can be jointly trained through the minimax scheme that alternates between the optimized segmentation network and the commenter network.

4.2 Training goals

  Let S and D be the segmentation network and the commenter network respectively.

4.3 Split network

  Our segmentation network is a fully convolutional network (FCN), which is also the core component of many state-of-the-art semantic segmentation models [14, 35, 5]. The success of FCN is largely attributed to the excellent ability of convolutional neural networks to extract high-level representations suitable for dense classification. FCN can be divided into two modules: down sampling path and up sampling path. The down-sampling path consists of a convolutional layer and a maximum or average pooling layer, and its structure is similar to that used in image classification [27]. The downsampling path can usually extract high-level semantic information at a lower spatial resolution. The upsampling path consists of a convolutional layer and a "deconvolutional layer" (also called transposed convolution) to use the output of the downsampling path to predict the score of each category at the pixel level.
  Most FCNs are applied to color images with RGB channels, and their down-sampling path is initialized with parameters trained in large-scale image classification [14]. However, CXR is gray-scale, so the larger model capacity used in the image classification network will utilize richer RGB input, which may be counterproductive. In addition, our FCN architecture must be highly parsimonious to take into account that our training dataset of 247 CXR images is several orders of magnitude smaller than the training dataset in the natural image domain. Finally, in our task, we focus on segmenting three categories (left lung, right lung, and heart), which has a smaller space compared to a dataset with 20 category objects (such as PASCAL VOC). Therefore, in this case, a more simplified model configuration is very advantageous.

  Figure 4 shows our FCN architecture. We have found that it is advantageous to use much fewer feature maps compared to the traditional VGG-based downsampling path. Specifically, we start from the 8 feature maps of the first layer, while the first layer of VGG has only 64 feature maps [27]. In order to obtain sufficient model capacity, we will go deep into 20 convolutional layers. We also interweave 1×1 convolution with 3×3 in the last few layers to simulate bottleneck design [10]. All in all, the segmentation network contains 271k parameters, which is 500 times smaller than the VGG-based FCN [14]. We use the residual block [10] (Figure 4(b)) to aid optimization. The simplified network structure allows us to effectively optimize it without relying on any existing training model, which is not available for grayscale images.

4.4 Critics Network

  Our critic network reflects the construction of segmented networks and is also a fully convolutional network. Figure 5 shows the architecture, omitting the same middle layer as the segmented network. In this way, the critic network has a similar model capacity as a segmentation network with a similar field of view, which is important because the objects in the CXR image are larger. We can choose to use the original CXR image as the commenter's input as an additional channel. Compared with [17], this is a more economical way to incorporate the image into the critic network. Preliminary experiments have shown that including the original CXR image does not improve performance. Therefore, for simplicity, we only provide the mask prediction to the annotator network. Overall, our reviewer network has 258k parameters, which is equivalent to the segmented network.


 

 

Guess you like

Origin blog.csdn.net/qq_36401512/article/details/103272791