【Synthetic Medical Data】SYNTHETICALLY ENHANCED: UNVEILING SYNTHETIC DATA'S POTENTIAL IN MEDICAL IMAGING RESEARCH

[Segment Anything Model] Column link for segmentation , welcome to learn.
[Dataset Introduction and Preprocessing] Column link for processing medical data sets , welcome to learn.
[Blogger WeChat] cvxiayixiao
[Computer Medical Paper] This column is for studying papers in the medical + AI direction

Paper information

Insert image description here

  • SYNTHETICALLY ENHANCED: UNVEILING SYNTHETIC DATA’S POTENTIAL IN MEDICAL IMAGING RESEARCH

  • Bardia Khosravi, MD MPH MHPE1,2,*
    , Frank Li, PhD 3,*
    , Theo Dapamede, MD Ph.D3
    , Pouria Rouzrokh, MD MPH
    MHPE1,2 , Cooper Gamble1
    , Hari M. Trivedi, MD3
    , Cody C. Wiles, MD2
    , Andrew B. Thalergren, BA4
    ,
    Saptashpur Kayastha , PhD5
    , Bradley J. Erickson, MD Ph.D1, †
    and Judy W. Gichoya, MD MS3, †

  • 1Department of Radiology, Mayo Clinic, Rochester,
    Minnesota, United States 2Department of Orthopedics, Mayo Clinic, Rochester, Minnesota,
    United States 3Department of Radiology, Emory University, Atlanta, Georgia
    , United States 4Google Health, Google, Palo Alto, California, United States
    5Indiana University - School of Informatics and Computing, Purdue University, Indianapolis, IN, United States
    *Co-first author † Co-senior author

  • ([email protected],[email protected])

Summary

Chest X-rays (CXR) are the most common medical imaging study and are used to diagnose multiple
medical conditions. This study examines the impact of synthetic data supplementation, using diffusion
models, on the performance of deep learning (DL) classifiers for CXR analysis. We employed three
datasets: CheXpert, MIMIC-CXR, and Emory Chest X-ray, training conditional denoising diffusion
probabilistic models (DDPMs) to generate synthetic frontal radiographs. Our approach ensured that
synthetic images mirrored the demographic and pathological traits of the original data. Evaluating the
classifiers’ performance on internal and external datasets revealed that synthetic data supplementation
enhances model accuracy, particularly in detecting less prevalent pathologies. Furthermore, models
trained on synthetic data alone approached the performance of those trained on real data. This
suggests that synthetic data can potentially compensate for real data shortages in training robust DL
models. However, despite promising outcomes, the superiority of real data persists.

Chest X-rays (CXR) are the most common medical imaging study and are used to diagnose a variety of medical conditions. this research 考察了使用扩散模型进行合成数据补充对深度学习(DL)分类器进行CXR分析性能的影响. We used three datasets: CheXpert, MIMIC-CXR, and Emory chest X-rays 训练条件去噪扩散概率模型(DDPMs)生成合成的正面放射图. our way 确保合成图像反映了原始数据的人口统计和病理特征. By evaluating the performance of the classifiers on internal and external datasets, it was found that synthetic data supplementation improved model accuracy, especially when detecting less common pathologies. Furthermore, models trained only on synthetic data approach the performance of models trained on real data. This suggests that synthetic data has the potential to compensate for the shortage of real data when training powerful DL models. However, despite the promising results, the superiority of real data remains

  • Experiment: Effect of using synthetic data (chest X-ray images generated by a conditional denoising diffusion probabilistic model) to supplement training data on classifier performance.
  • Conclusions: The use of synthetic data can improve the accuracy of models in detecting uncommon pathologies, and models trained with synthetic data alone are close in performance to models trained with real data.

Intro

Mainly includes a comprehensive analysis of the application of chest X-rays (CXR) and deep learning (DL) in medical imaging research

The Importance and Challenges of Chest X-rays

  • Widely Used: As a primary imaging modality, CXR is widely used for the diagnosis of a variety of conditions ranging from acute respiratory distress to chronic pathologies such as lung cancer.
  • Fast and efficient triage: Especially important in emergency situations, it is the most commonly performed diagnostic imaging test.
  • Expert dependence: Although chest radiography has great potential for diagnosis and screening, its interpretation still requires the expertise of the radiologist.
  • Healthcare resource bottlenecks: Increased demand and limited availability of radiologists creates bottlenecks in healthcare delivery, especially in underserved areas.

Application of artificial intelligence and deep learning in CXR

  • FDA-approved tool: Used to detect pathologies such as pneumothorax, pleural effusion, and rib fractures.
  • Model generalization issues: These DL-based models may not always generalize, and performance may degrade when applied to new populations.

Methods to improve model generalization ability

  • Method exploration: including increasing the size and diversity of training samples, or joint model training.
  • Data sharing challenges: Merging data from different institutions can be difficult due to issues involving patient privacy.

Research on generative artificial intelligence

  • 内容创建模型: Aims to develop models that can create real-world content (including text, images, video, and audio) based on training distributions.
  • 图像生成模型: Faced with the triple problems of quality, diversity and generation speed.

Applications and challenges of synthetic data

  • Potential of synthetic data: Possibly used to address model performance and generalization challenges.
  • Effects of Synthetic Data Augmentation: Overall model performance can be improved by synthesizing high-fidelity images as dataset augmentations.
  • Performance degradation concerns:迭代使用合成数据可能导致灾难性干扰,即模型遗忘。

Research objectives and methods

  • Research Purpose: To investigate the effects of synthetic data augmentation in medical imaging research and understand the factors that drive model development.
  • Methodology: First 在CheXpert数据集的子集上训练条件DDPM, then 创建一个与原始数据集具有相同人口统计和病理特征的合成副本. 测试使用真实和合成数据训练的多种病理分类器的性能Explore the potential and limitations of synthetic data through internal and external sources

Method

2.1 Dataset description

Dataset collection: The study collected all available frontal chest X-rays from the CheXpert (CXP), MIMIC-CXR (MIMIC) and Emory Chest X-ray (ECXR) datasets.
Automated labeling: All three datasets were annotated using the same automated natural language processing (NLP) algorithm, CheXpert Labeler, which classified 14 medical conditions as “present,” “absent,” “not mentioned,” and “uncertain” "Four categories.
Data preprocessing: All images were preprocessed, including resizing to 256 x 256 pixels, maintaining aspect ratio, and by padding and equalizing the image histogram to 256 bins.

Why pad and equalize image histogram to 256 bins to preserve aspect ratio

Filling and equalizing the image histogram to 256 bins is a common image processing technique, which is mainly based on the following reasons:

Color Depth: Most modern images have 8-bit color depth, which means that each color channel (red, green, blue) can represent 256 different intensity levels (from 0 to 255). Therefore, using 256 bins to represent the histogram can fully capture the color information in the image.

Data Accuracy: Filling and equalizing the histogram to 256 bins improves image contrast, making details clearer. This is because equalizing the histogram can expand the color range of those darker or lighter areas, making details in those areas more obvious.

Maintain aspect ratio: Aspect ratio refers to the ratio of the width and height of an image. 在处理直方图时,通常关注的是颜色信息, rather than the actual size or shape of the image. Therefore, padding and equalizing the histogram to 256 bins usually does not affect the aspect ratio of the image. This process mainly enhances the image quality without changing the original image aspect ratio.

Performance considerations: Using a fixed number of bins (such as 256) simplifies the calculation process and allows the use of standardized algorithms and tools. This makes image processing more efficient and consistent.

Overall, padding and equalizing the histogram to 256 bins is an effective way to balance image quality, data accuracy, and processing efficiency.

2.2 Image generation

使用DDPMs生成图像: Investigate the use of denoising diffusion probabilistic models (DDPMs) to create synthetic images. DDPMs work by combining forward and reverse diffusion processes.
前向扩散过程:A small amount of Gaussian noise is gradually added to the initial image. As the number of time steps increases, the initial image gradually transforms into isotropic Gaussian noise.
反向扩散过程: Designed to estimate the noise addition between consecutive steps. This is accomplished by training a deep learning model, often called a diffusion model.
条件模型训练: A generative model based on gender, age, race and 14 pathology labels was trained on the CXPT r dataset.

Here are the highlights:
分类器无引导(CFG) : Use classifier-free bootstrapping techniques to make the generated images correspond to conditional variables. CFG uses learned empty embeddings during training, randomly swapped with actual class embeddings.
CFG尺度的影响: The impact of CFG scaling on downstream tasks was investigated by making three CXPT r replicas with identical demographic and pathological labels, differentiated by CFG scaling set to {0, 4, 7.5}.

What are the characteristics of Gaussian noise?

Gaussian noise, also known as normal noise, is a common noise type in the fields of image processing and signal processing and has the following characteristics:

Statistical properties: Gaussian noise follows a Gaussian distribution (normal distribution), which means that the noise values ​​are distributed around a mean value (usually 0), and the shape of its distribution is determined by the standard deviation. The Gaussian distribution is symmetrical and has a bell-shaped curve.
Randomness: Gaussian noise is a type of random noise whose value is random at every point. This randomness makes Gaussian noise appear as irregular granular shapes in images or signals.
Independence: Gaussian noise is usually independent of each other at different pixels or time points, that is, the noise value at one point will not affect the noise value at other points.
Additive: Gaussian noise is often considered additive, meaning that it is simply added to the true value of the signal or image. This property allows Gaussian noise to be processed and reduced by various linear methods.
Frequency characteristics: Gaussian noise is usually full-band in the frequency domain, which means that it affects all frequency components of an image or signal.
Colorless: Gaussian noise is considered "colorless" because it is evenly distributed across the spectrum, as opposed to "colored noise" such as pink noise or Brownian noise.
Energy distribution: In Gaussian noise, most of the energy is concentrated near the average value, and the energy decreases rapidly as it moves away from the average value.

Because of these properties, Gaussian noise is very useful in simulating real-world noise and in the testing and evaluation of computer vision and signal processing algorithms.

Objectives and Experimental Design

Research purpose: To investigate the impact of synthetic data augmentation in medical imaging research and understand the factors contributing to model development.
Generation of large-scale synthetic dataset: In 确定最合适的CFG尺度后, a large synthetic dataset was generated in which each real image was replicated into 10 synthetic variants, each maintaining the same demographic and pathological attributes but using different initialization Seeds to increase the diversity of synthetic datasets.

2.3 Pathological classification

Model selection: All experiments were conducted using the ConvNeXt-base model pre-trained on natural image datasets.
Input size: An input size of 256 x 256 pixels was chosen, which has been shown to capture enough information to train a state-of-the-art supervised classifier.
Data enhancement: Standard online enhancement techniques provided by the MONAI package are used, including horizontal and vertical flipping, rotation (±60 degrees), scaling (±10%) and translation (±12 pixels).
Learning rate and weight decay: Use a learning rate of 0.00001 and a weight decay of 0.0003, combined with the Lion optimizer and binary cross-entropy loss.
Training stability: In order to further stabilize training, the exponential moving weight average (EMA) technology is used with a decay factor of 0.9999.
Model selection criteria: Select the best model based on the lowest validation loss value.

2.3.1 Use synthetic data + real data

Purpose of the experiment: Test whether synthetic data from the same dataset can improve the performance of the model on the same test set distribution.
Experimental design: Use synthetic data of different proportions (increasing from 100% to 1000%, increasing by 100% each time) to supplement the real training set.
Experimental process: For example, 300% supplementation means adding three times the number of synthetic image sets to the original images and using them to train the model.
Validation and training: Randomly select 10% of CXPT s for validation and the rest for model training. Synthetic images are not included in the validation set.
Baseline comparison: A model trained using pure real data (0% supplementation ratio) is used as a baseline for comparison.
Performance testing: Test the performance of these 11 models on CXPT s, MIMIC-CXR and ECXR.

2.3.2 Pure synthetic data

Experimental design: Evaluate the performance of a model trained using only synthetic data, simulating the sharing of synthetic data only to external agencies.
Purpose of the experiment: To establish the utility of synthetic data used alone, and to show the extent to which synthetic data can substitute for real data without sacrificing performance.
Experimental process: The same segmentation method as the previous experiment, but all real data are excluded from the training set while retaining the real validation set.
Model training: Train 10 models, each using a different number of synthetic images (100% - 1000% supplementary), and evaluate their performance on CXPTs, MIMIC-CXR and ECXR.

2.3.3 Synthetic data + external data set

Purpose of the experiment: To evaluate the generalization ability of models trained on a combination of real and synthetic data with different distributions.
Experimental design: Use MIMICTr as the training set (divided into 90% training, 10% validation) and mixed with different proportions of synthetic data generated based on CXPTr, similar to the previous experiment.
Generalization ability evaluation: Evaluate the impact of different proportions of synthetic data on the model's generalization ability on different data sets.
Performance Testing: The 10 models were evaluated based on their performance on CXPTs, MIMICTs and ECXR.

2.4 Assessment methods

Fréchet Inception Distance (FID): Used to evaluate the quality and diversity of generated images. FID is calculated by using the InceptionV3 network (trained for natural image classification tasks), comparing the Fréchet distance of the penultimate layer features of real and synthetic images.

  • Performance evaluation of pathology classifiers: Using the area under the receiver operating characteristic curve (AUROC) as the primary metric.

  • Confidence interval calculation: Use 1000 bootstraps to calculate 95% confidence intervals (CI) and compare models by paired t-test.

  • Correction for multiple comparisons: Use Bonferroni correction to adjust the probability of making a Type I error (α). In all cases, α = 0.05 was considered as significant level.

  • Label distribution understanding: Plot the pathology co-occurrence matrix for each data set and use the Pearson correlation coefficient to compare the similarity between them.

  • Inference speed measurements: Reported based on 80GB A100 GPU.

  • Statistical analysis tools: All statistical analyzes were performed using the scikit-learn package (v1.3.1).

This description of the assessment approach highlights its comprehensiveness and rigor. By using FID to evaluate the quality and diversity of synthetic images, combined with AUROC to evaluate the performance of pathology classifiers, this evaluation method provides a comprehensive understanding of the utility of synthetic data. In addition, the use of bootstrapping to calculate confidence intervals, paired t-tests for model comparisons, and Bonferroni correction for correction for multiple comparisons demonstrate the statistical rigor of the evaluation process. Finally, the study provides further insights into the properties and application potential of synthetic data by analyzing label distributions and comparing similarities between different datasets using Pearson correlation coefficients.

Result

1. Mixed experiment of synthetic data and real data

Experimental design: Use different proportions of synthetic data (100% to 1000%) mixed with real data for training the model.
Performance improvements: The results show that in some cases, adding synthetic data can significantly improve the model's performance on real test sets.
Best ratio:特定比例的合成数据(如300%或400%)提供了最佳性能提升。

2. Pure synthetic data experiment

Purpose of the experiment: Test the performance of a model trained using only synthetic data, and simulate a situation where external agencies can only access synthetic data.
Results: 仅使用合成数据训练的模型在测试集上的表现接近于使用真实数据的模型, showing that synthetic data has the potential to replace real data.

3. Mix experiments with data from different sources

Purpose of the experiment: To evaluate the generalization ability of the model when real data and synthetic data from different sources are mixed.
Generalization ability: The results show that混合使用不同来源的数据可以提高模型在外部测试集上的泛化能力。

4. Image quality and diversity assessment

Evaluation using FID: Evaluate the quality and diversity of synthetic images by Fréchet Inception Distance (FID).
Quality and Diversity: Synthetic images perform well on FID scores, explained 生成的图像具有较高的质量和多样性.

5. Pathology classifier performance evaluation

Evaluation using AUROC: Use the area under the receiver operating characteristic curve (AUROC) as the main metric to evaluate the performance of pathology classifiers.
Classifier performance:合成数据的使用在一定程度上改善了分类器的AUROC表现,尤其是在合成数据与真实数据混合使用时。

Summarize

These results demonstrate that synthetic data has significant application potential in medical image analysis. They can be used to improve model performance and generalization capabilities, especially in data-constrained scenarios. By exploring the use of synthetic data in different experimental settings, this study provides insights into the application of synthetic data in the field of medical imaging.

Discussion

1. Application of synthetic data in medical imaging

Performance gap problem: There is a performance gap when the model is tested on data from different sources. In the past, the use of synthetic data in medical imaging was limited by its low quality, but the emergence of new technologies, such as diffusion models, provides opportunities to create high-quality, diverse medical images.
The study found that: Deep learning models trained on synthetic data can achieve performance levels comparable to those trained on real data, indicating the feasibility of synthetic data sets in medical imaging.

2. Improve model performance and generalization ability

Detection of less common pathologies: Supplementing real datasets with synthetic data can significantly enhance model performance and generalization capabilities, especially in detecting less common pathologies.

3. The role of CFG (classifier without guidance)

CFG尺度影响: 实验发现,CFG尺度为0时生成的合成图像与真实图像最为相似,类似于GANs中的截断因子(Φ)。 CFG尺度提高的影响: 提高CFG尺度会导致模型在学习真实图像中更微妙的病理信号时性能下降。

4. Data breach and privacy issues

Risk of data breach: In synthetic data generation, there is the possibility of data breach, especially in the medical field where patient anonymity is crucial.
Solution: Despite some experimental solutions, synthetic data anonymization remains a topic at the forefront of research.

5. Experimental limitations

Label abstraction error: Using the CheXpert labeler as a condition variable for image generation, there may be abstraction errors that affect image quality and the true labels for classification.
Single CFG scale: Only one CFG scale image is used, and the impact of different CFG scale combinations is not explored.
Not Validated on Other Tasks: Other tasks such as segmentation and object detection have not been validated with similar rigor.
Unchanged disease distribution: Unchanged disease prevalence, only demonstrating the potential of synthetic data.

Article summary

This study shows that synthetic data is useful in training downstream classification models and can match or outperform classifiers trained on real data in a large number of cases. We also present optimal hyperparameters for generating synthetic datasets and demonstrate generalization on two large datasets. Importantly, even small amounts of synthetic data can close the generalization gap of models trained on other data sources. However, the quality of real data is still better than that of synthetic data, and collecting more data should be the preferred solution for increasing the dataset size
Insert image description here

Guess you like

Origin blog.csdn.net/cvxiayixiao/article/details/134523129