Medical Image Analyse

NC2022: Federated learning enables big data for rare cancer boundary detection

Although machine learning (ML) has shown promise across disciplines, out-of-sample generalization remains a concern. This problem is currently addressed by sharing data across multiple sites, but such centralized processing is difficult to scale due to various constraints. Federated machine learning (FL) offers an alternative paradigm to achieve accurate and generalizable ML by sharing only numerical model updates. Here we present the largest study of FL to date, involving data from 71 sites on 6 continents, to generate an automated tumor boundary detector for the rare disease glioblastoma, with the largest dataset reported in the literature (n = 6,314) . We demonstrate that our method achieves a 33% improvement in demarcation of the border for operable tumors and a 23% improvement for the full tumor extent, surpassing publicly trained models.

We anticipate that our research will:

  • 1) Enabling more healthcare research to be informed by large and diverse data, ensuring meaningful outcomes for rare diseases and underrepresented populations;

  • 2) Facilitate further analysis of glioblastoma by publishing our consensus model;

  • 3) Demonstrate the effectiveness of FL at this scale and task complexity as a paradigm shift for multi-site collaboration, alleviating the need for data sharing.

  • mpMRI: multiparametric magnetic resonance imaging (mpMRI) scan

The entire federated learning process takes a staged approach:

  1. Start with a " common initial model " (trained using data from 231 cases at 16 sites)
  2. This was followed by a " preliminary consensus model " (trained using data from 2471 cases at 35 sites)
  3. Finally, a " final consensus model " (trained using data from 6314 cases from 71 sites) was formed.

To quantitatively evaluate the performance of the trained model, 20% of the total number of cases contributed by each participating site was excluded from the model training process and used as 'local validation data'. To further evaluate the generalization ability of the model in unseen data, 6 sites did not participate in any training stage, representing an unseen “out-of-sample” data population, with a total of 590 cases. To facilitate further evaluation without burdening collaborating sites, a subset of these cases (n = 332) was aggregated into a 'centralized out-of-sample' dataset. Training starts from a pretrained model (i.e. our public inception model) rather than from a random initialization point for faster convergence of model performance. Here, model performance was quantitatively assessed using the Dice similarity coefficient (DSC), which assesses the spatial agreement of the model's predictions with the reference standard in three tumor subregions (ET, TC, WT).

  • Dice similarity coefficient : A measurement method for comparing the similarity of two samples, usually used in medical image segmentation. It is defined as twice the size of the intersection of two samples divided by the size of their union . In this text, the Dice similarity coefficient is used to evaluate the performance of the model, which measures the prediction results of the model and the reference standard in terms of three tumor subregions (ET, TC, WT) 空间一致性.
    insert image description here

  • The Wilcoxon signed-rank test is a nonparametric hypothesis testing method used to compare whether the medians of two related samples differ. It is suitable for situations where the data does not obey the normal distribution, and can be used for small samples and large samples. The null hypothesis of the test is that the medians of the two related populations are equal, and the alternative hypothesis is that the medians of the two related populations are not equal. Its basic idea is to sort the difference between two samples by size, and then compare whether the signs are the same to obtain a statistic. If the P value of the statistic is less than the significance level, the null hypothesis is rejected and the median of the two samples is considered to be significantly different

  • We leveraged the data loading and processing pipeline of a general granular deep learning framework ( GaNDLF ) in order to experiment with various data augmentation techniques. After data loading, we removed all zero-axis, coronal and sagittal planes from the images and performed Z-score normalization on non-zero image intensities.

  • 模型结构:3D-ResUNet,The network had 30 base filters, with a learning rate of
    lr = 5 × 10−5 optimized using the Adam optimizer102. For the loss function used in training, we used the generalized DSC score

  • No penalties were used in the loss function, due to our use of ‘mirrored’ DSC loss


This paper presents the setup and results of the Liver Tumor Segmentation Benchmark (LiTS), which was developed in collaboration with the 2017 IEEE International Symposium on Biomedical Imaging (ISBI) and the 2017 and 2018 International Conference on Computing and Computer-Assisted Intervention in Medical Images (MICCAI) organized together. This image dataset contains primary and secondary tumors, varying in size and appearance, with various levels of lesion versus background (high/low density) and was created in collaboration with seven hospitals and research institutions. A total of 75 submitted liver and liver tumor segmentation algorithms were trained on 131 computed tomography (CT) volumes and tested on 70 unseen test images from different patients. We found that, among the three events, no algorithm performed optimally in the segmentation of the liver and liver tumors. The Dice score of the best liver segmentation algorithm was 0.963, while for tumor segmentation, the Dice scores of the best algorithms were 0.674 (ISBI 2017) and 0.702 (MICCAI 2017), respectively.

Technical Challenges of Liver Segmentation:
Fully automated segmentation of the liver and its lesions remains challenging in many ways.

  1. First, changes in lesion-to-background contrast (Moghbel et al., 2017) may be caused by:
    • (a) change in contrast agent ,
    • (b) Changes in contrast enhancement due to different injection times ,
    • and (c) different acquisition parameters (e.g. resolution, mAs and kVp exposure, reconstructed nuclei).
  2. Second, the coexistence of different types of lesions (benign and malignant as well as tumor subtypes) , with variations in their image appearance, poses additional challenges for automated lesion segmentation.
  3. Third, liver tissue background signal can vary greatly in the presence of chronic liver disease , a common precursor to HCC. Many algorithms have been observed to struggle with disease-specific variability, including differences in the size, shape, and number of lesions, as well as treatment-induced modifications to the shape and appearance of the liver organoid itself (Moghbel et al., 2017).

The difference in liver and tumor appearance between two patients is shown in Figure 1, illustrating the challenge of generalizing to unknown test cases with different lesions.
insert image description here

key contributions to fully automated liver and liver tumor segmentation

  1. we generate a new public multi-center dataset of 201 abdominal CT Volumes and
    the reference segmentations of liver and liver tumors
  2. present the set-up and the summary of our LiTS benchmarks in three
    grand challenges
  3. Third, we review, evaluate, rank, and analyze there sulting state-of-the-art algorithms and results.

分割方法:
Published work on liver segmentation methods can be grouped into three categories based on:
(1) prior shape and geometric knowledge,
(2) intensity distribution and spatial context
(3) deep learning.

Published studies on liver segmentation methods can be grouped according to the following three categories: (1) prior shape and geometry knowledge, (2) intensity distribution and spatial context, and (3) deep learning.

Methods based on intensity distribution and spatial context.

  1. 概率图谱(PA)is an anatomical atlas with parameters learned from the training dataset. Park et al. proposed the first PA, utilizing 32 abdominal CT series for registration based on mutual information and thin-plate splines as deformation transformation (Park et al., 2003), and segmentation using Markov random fields (MRF ) ( Park et al., 2003).
  2. Further proposed atlas-based methods differ in computing PA and how to incorporate PA into segmentation tasks.
  3. In addition, PA can include relationships between adjacent abdominal structures to define the anatomy surrounding the liver (Zhou et al., 2006).
  4. Multi-atlas methods improve hepatic and non-hepatic voxel classification by using B-spline transformation models for non-rigid registration (Slagmolen et al., 2007), dynamic atlas selection and label fusion (Xu et al., 2015), or k-nearest neighbor- based classification of hepatic and non-hepatic voxels. Segmentation results (van Rikxoort et al., 2007).
  • Graph cut methods provide an efficient initialization method for binary segmentation problems through adaptive thresholding (Massoptier and Casciaro, 2007) and superpixels.

methods based on deep learning. Unlike the above methods, deep learning, especially convolutional neural network (CNN), is a data-driven approach that can be optimized end-to-end without manual feature engineering (Litjens et al., 2017). The U-shaped CNN architecture (Ronneberger et al., 2015) and its variants (Milletari et al., 2016; Isensee et al., 2020; Li et al., 2018b) are widely used in biomedical image segmentation and have been used in a wide range of segmentation tasks Its efficiency and robustness are proved in . The best performing methods share the commonality of the multi-stage process,

  • Segmentation is performed starting with a 3D CNN, and the resulting probability map is post-processed using a Markov random field (Dou et al., 2016).

Many early deep learning algorithms for liver segmentation combined neural networks with dedicated post-processing procedures : Christ et al. (2016) used 3D fully connected neural networks and conditional random fields, Hu et al. (2016) relied on 3D CNNs, Then there is the surface model. In contrast, Lu et al. (2017) use CNN for normalization followed by graph cut segmentation.

  • Image segmentation + conditional random field

3D fully connected neural network is a deep learning algorithm for image segmentation, which can perform end-to-end optimization on input 3D image data without manual feature engineering. Conditional random field is a probabilistic graphical model, which can introduce prior knowledge into image segmentation, and improve the accuracy of segmentation by smoothing and optimizing the segmentation results. In the method of Christ et al., they segment liver images using a 3D fully connected neural network, and then use conditional random fields for post-processing to improve the accuracy of the segmentation. This method can make the segmentation results smoother and more natural.

Liver Segmentation:

Compared with the liver, the lesion morphology, size, and contrast range of liver cancer are more comprehensive. Liver tumors can appear in almost any location and often have indistinct borders. Differences in contrast absorption may introduce additional variability. Therefore, liver tumor segmentation is considered to be a more challenging task. Published methods for liver tumor segmentation can be categorized into:
(1) thresholding and spatial regularization, (2) local features and learning algorithms, and (3) deep learning.

  • One of the methods is thresholding and spatial regularization.
  1. Thresholding is a simple yet effective tool to automatically separate tumors from liver and background based on the difference in gray level values ​​of the tumor region from the pixels/voxels of the liver and background regions, originally first proposed by Soler et al. (2001 ) exhibit.

  2. Thereafter, thresholding can be used to improve tumor segmentation results by histogram analysis (Ciecholewski and Ogiela, 2007), between-class maximum variance (Nugroho et al., 2008), and iterative algorithms (Abdel-Massieh et al., 2010).

  3. Spatial rule techniques rely on (prior) image or morphological information, such as tumor size, shape, surface or spatial information. Using this knowledge, regularization or penalty constraints can be introduced .

  4. Adaptive thresholding methods can be combined with model-based morphological processing for heterogeneous lesion segmentation (Moltz et al., 2008, 2009).

  5. Contour-based (Kass et al., 1988) tumor segmentation relies on shape and surface information and utilizes probabilistic models (Ben-Dan and Shenhav, 2008) or histogram analysis (Linguraru et al., 2012) to automatically create segmentation maps. The level set (Osher and Sethian, 1988) method allows numerical computation of tumor shape without parameterization.

    • Level set methods are combined with supervised pixel/voxel classification in 2D (Smeets et al., 2008) and 3D (Jiménez Carretero et al., 2011) for liver tumor segmentation.
  6. Methods using local features and learning algorithms .

    • Clustering methods include k-means (Massoptier and Casciaro, 2008) and fuzzy c-means clustering, using deformable models for segmentation refinement (Häme, 2008).
    • Among the supervised classification methods, there are level set methods based on fuzzy classification (Smeets et al., 2008), support vector machines combined with texture-based deformable surface models for segmentation and refinement (Vorontsov et al., 2014), AdaBoost based on texture features (Shimizu et al., 2008) and image intensity profiles (Li et al., 2006), logistic regression (Wen et al., 2009), and random forests for recursive classification and decomposition of superpixels (Conze et al., 2017).

deep learning

Before LiTS, deep learning methods were rarely used for the task of liver tumor segmentation.

Christ et al. (2016) used 3D U-Net for liver and liver tumor segmentation for the first time, proposing a cascaded segmentation strategy combined with 3D conditional random fields for refinement . Many subsequent deep learning methods were developed and tested with the LiTS dataset.

Thanks to the availability of the LiTS public dataset, many new deep learning solutions on the liver and liver segmentation have been proposed. The U-Net architecture is widely used and modified to improve segmentation performance.

  • For example, nn-UNet (Isensee et al., 2020), first presented in LiTS at MICCAI 2018, proved to be one of the best performing methods on the task of 3D image segmentation. Related work will be discussed in the Results section.

However, the overall challenge also required participants to solve nine other tasks, including brain tumor, heart, hippocampus, lung, pancreas, prostate, liver vasculature, spleen, and colon segmentation. Therefore, the algorithm is not necessarily optimized only for liver CT segmentation.

  • The studied cohort covered multiple types of liver neoplastic disease, including primary neoplastic disease (such as hepatocellular carcinoma and cholangiocarcinoma) and secondary liver tumors (such as metastases from colorectal, breast, and lung cancers).

Tumors have different lesion-to-background ratios (hyperdense or hypodense). Images represent a mix of preoperative and postoperative abdominal CT scans and were acquired using different CT scanners and acquisition protocols, including imaging artifacts (eg, metal artifacts) commonly seen in real-world clinical data.

  • Therefore, it is considered to be very diverse in terms of resolution and image quality. Planar image resolutions range from 0.56 mm to 1.0 mm, and slice thicknesses range from 0.45 mm to 6.0 mm. Furthermore, the number of axial slices was varied from 42 to 1026. The number of tumors varied between 0 and 12. Tumors varied in size from 38 mm3 to 1231 mm3. The incidence of tumors in the test set is higher than that in the training set. Statistical tests (p-value = 0.6) indicated that there was no significant difference in liver volume between the training and test sets. The average tumor HU values ​​in the training set and test set were 65 and 59, respectively. The statistical summary of LiTS data is shown in Table 3. The ratio of training set and test set is 2:1, and the training set and test set are similar in terms of center distribution. Therefore, generalizability to unseen centers was not tested in LiTS.

Evaluation Index

  1. Dice score
  2. d(v,S(A)): Average symmetric surface distance (ASSD) is an indicator for evaluating the performance of medical image segmentation, which is used to measure the surface distance between the segmentation result and the real annotation. It calculates the distance between two surfaces and averages it. It can be used to evaluate the boundary accuracy of segmentation results and the degree of segmentation errors. It is one of the commonly used segmentation evaluation metrics.
    • "Average symmetric surface distance" refers to the average distance between two split surfaces, which is the average value obtained by adding the distance values ​​between the two split surfaces and then dividing by the number of surface points.
    • "Maximum symmetric surface distance" refers to the maximum distance between two split surfaces, i.e. the distance is calculated between all surface points and the largest distance is chosen as the result.
  3. "Relative volume difference" is a measure of error in volume measurements and is often used to compare the difference between two volume measurements. It calculates the relative difference between two volume measurements by dividing the difference between the two volume values ​​by their average. This metric can be used to evaluate the accuracy of medical image segmentation algorithms and to make comparisons between different datasets or different algorithms.

Given the clinical relevance of lesion detection, we included three detection metrics in additional analyses.

To avoid potential problems in the case of patients without tumors, metrics are calculated globally. In order to evaluate lesion-level metrics, there must be a known correspondence between predicted and reference lesions. Since all lesions are defined as a single binary map, the correspondence between the connected components of the predicted and reference masks has to be determined.

Each lesion is defined as a 3D connected component in the image. A lesion is considered detected if there is sufficient overlap between the predicted lesion and its corresponding reference lesion, i.e., the ratio of the intersection to the union of their respective segmentation masks is large enough. This allows counting the number of true positive, false positive and false negative detections and thus the precision and recall of lesion detection. These indicators are defined as follows:

insert image description here

Algorithm and architecture

In this competition, 73 entries adopted a fully automated approach, while only one was semi-supervised (J. Ma et al.).
In this competition, the U-Net-derived architecture was widely adopted, and only two automated methods adopted the modified VGG-net (J. Qi et al.) and k-CNN (J. Lipkova et al.). Most of the entries adopted a coarse-to-fine approach, cascading multiple U-Nets at different stages to perform liver and liver segmentation .

  • Additional residual connections and adjusting input resolution are the most common improvements to the base U-Net architecture.
  • Three schemes combine separate models into ensemble techniques.
  • In 2017, none of the entries directly used 3D methods at native image resolution due to high computational complexity. However, some schemes only use 3D convolutional neural networks with small input patches for tumor segmentation tasks.
  • Other approaches try to capture the three-dimensional advantage by using a 2.5D model architecture, i.e. feeding a stack of images as a multi-channel input to the network, and a segmentation mask of the center slice of this stack as the network output.

Main Components of Segement Model

  • Among most methods, data preprocessing using HU value clipping, normalization, and standardization are the most common techniques.
  • Data augmentation is also widely used, mainly focusing on standard geometric transformations such as flipping, moving, scaling or rotating.
  • Individual submissions implement more advanced techniques such as histogram equalization and stochastic contrast normalization.
  • The most common optimizers vary between ADAM and stochastic gradient descent with momentum, one of which relies on RMSProp .
  • Multiple loss functions for training, including standard and weighted cross-entropy, Dice loss, Jaccard loss, Tversky loss, L2 loss
  • Ensemble loss techniques that combine multiple individual loss functions into one.

Post-processing.

Some post-processing methods are also used by most algorithms.

  • A common post-processing step is to form a connected tumor component and overlay the liver mask on the tumor segmentation to discard tumors outside the liver region.
  • More advanced methods include random forest classifiers, morphological filtering, specific shallow neural networks to eliminate false positives or custom algorithms for filling tumor cavities.

Features of top-performing methods.

The best-performing methods at ISBI 2017 used cascaded U-Net approaches with short and long skip connections and 2.5D input images

  • weighted cross-entropy loss functions
  • a few ensemble learning techniques were employed by most of the top-performing methods,
  • some common pre- and post-processing steps such as HU-value clipping and connected component labeling

Some of the contestants who performed well at MICCAI 2017 (e.g., J. Zou) incorporated insights from ISBI 2017, including the idea of ​​ensemble learning, adding residual connections and more complex rule-based post-processing or classical machine learning algorithms.

  • Therefore, the main architectural differences compared to the ISBI submission are the higher usage of ensemble learning methods, higher incidence of residual connections, and more complex post-processing steps.
  • Another well-performing method proposed by X. Li et al. presents a hybrid insight, integrating the advantages of 2D and 3D networks into the 3D liver tumor segmentation task .

Technique trend and recent advances

  1. A significant advance was on the 3D deep learning model besides the 2D approaches.
  2. self-supervised pre-training frameworks to initialize 3D models for better representation than training them from scratch
  3. self-configuring pipeline to facilitate the model training and the automated design of network architecture
  4. added a 3D attention module for 3D segmentation models.
  5. focused on the special trait of liver and liver tumor segmentation and proposed a novel active contour-based loss function to preserve the segmentation boundary
  6. enhance edge information and cross-feature fusion for liver and tumor segmentation
  7. considered the varying lesion sizes and proposed a loss reweighting strategy to deal with size imbalance in tumor segmentation.
  8. attempted to deal with the heterogeneous image resolution with a multi-branch decoder
  9. An emerging trend is to leverage existing sparsely labeled images for multi-organ segmentation .
    • Huang et al. (2020) attempted joint training on a single organ dataset (liver, kidney, and pancreas).
    • Fang and Yan (2020) propose a pyramid-in and pyramid-out network to compress multi-scale features to reduce semantic gaps.
    • Finally, Yan et al. (2020) developed a general lesion detection algorithm to detect various lesions in CT images in a multi-task manner and proposed a strategy to mine missing annotations from partially labeled datasets .

Segmentation performance for lesion size. Overall, the submitted method performs very well in the segmentation of large liver cancers 但是对于小型肿瘤的分割却很困难(see Figure 8).

Many small tumors are only a few voxels in diameter; moreover, images in axial slices have a relatively high resolution of 512 × 512 pixels.

Therefore, it is difficult to detect these small structures due to the low number of potentially distinct surrounding pixels, which may indicate potential tumor boundaries (see Figure 8). This situation exacerbates the noise and artifacts in medical imaging that arise from size similarities; textural differences from surrounding liver tissue and their arbitrary shapes that are difficult to distinguish from actual liver cancers.

**In general, state-of-the-art methods perform well on volumes with large tumors and poorly on volumes with small tumors. **The worst results were achieved in examinations where a single small tumor (<10mm3) was present. The best results were achieved where the volume showed fewer than six tumors and the total tumor volume exceeded 40 mm3 (see Figure 8). In the appendix, we show the performance of all submitted methods for the three LiTS challenges, compared for each test volume, clustered by number of lesion occurrences and lesion size, see Figure A.10.


Effect of Contrast on Segmentation Quality

The segmentation quality of the method is affected by the difference between tumor and liver HU values. Current state-of-the-art methods perform best in volumes showing higher contrast between liver and tumor. Especially in the case of lesions with a density 40-60 HU higher than the background liver (see Figure 8). The worst results are in cases with contrast below 20 HU (see Figure 8), including tumors that are less dense than the liver.

  • The mean difference in HU values ​​makes it easier for the network to distinguish liver from tumor, as a simple threshold derivation rule can be incorporated as part of the decision process. Interestingly, larger discrepancy values ​​do not lead to better segmentation results.

HU值(Hounsfield Units) are units used to describe tissue density in computed tomography (CT) imaging. The HU value is derived by comparing the tissue density value with the density of water, which is defined as 0 HU, and air is -1000 HU. Therefore, HU values ​​can be used to distinguish tissues of different densities, such as tumors and livers. In medical image segmentation, HU values ​​are usually used to determine the boundaries of liver and tumor.


Future Work

Furthermore, we propose to provide liver tumors with multiple reference annotations from multiple annotators. This is because the segmentation of liver tumors is highly uncertain due to small structures and blurred boundaries (Schoppe et al., 2020). While most segmentation tasks in existing benchmarks are formulated as one-to-one mapping problems, it does not fully address image segmentation where data uncertainty naturally exists. Modeling uncertainty in segmentation tasks is a trend (Mehta et al., 2020; Zhang et al., 2020b) that will allow models to generate not just one but multiple plausible outputs . Therefore, it will enhance the applicability of automated methods in clinical practice. Published annotated datasets are not limited to benchmarking segmentation tasks, but can also be used as data for recent shape modeling methods such as implicit neural functions (Yang et al., 2022; Kuang et al., 2022; Amiranashvili et al., 2021) .

Given the size and importance diversity of patient populations at the seven institutions participating in the LiTS benchmark dataset, we believe its value and contribution to medical image analysis will be highly appreciated in numerous directions.

  • One example is in the domain adaptation research direction , where LiTS datasets can be used to account for apparent differences/shifts in data distribution due to domain changes (e.g. acquisition settings) (Glocker et al., 2019; Castro et al., 2020).

  • Another recent and compelling use case is the research direction of federated learning , where the multi-institutional nature of the LiTS benchmark dataset can further contribute to federated learning simulation research and benchmarking (Sheller et al., 2018, 2020; Rieke et al., 2020; Pati et al., 2021). It will target potential solutions to LiTS-related tasks without sharing patient data across institutions. We believe federated learning is especially important because scientific maturity in this area could lead to a paradigm shift in multi-institutional collaboration . Additionally, it is overcoming technical, legal, and cultural data-sharing concerns because the patients involved will always remain within their home institution.

Guess you like

Origin blog.csdn.net/RandyHan/article/details/130634840