Deep Learning for Medical Image Segmentation: Tricks, Challenges and Future Directions


foreword

This article aims to analyze a review as an introduction to medical image segmentation, which can provide a more intuitive and comprehensive understanding of some tricks used in the field of medical image segmentation.

Original paper link: Deep Learning for Medical Image Segmentation: Tricks, Challenges and Future Directions


1. Abstract & Introduction

1.1. Abstract

The main contributions in the recent field of medical image segmentation often focus on the improvement of network architecture, training strategy and loss function, while inadvertently ignoring some edge details (also known as skills).

In this post, we collect a series of techniques for medical image segmentation at different model implementation stages (i.e., pre-training model, data preprocessing, data augmentation, model implementation, model inference, and result postprocessing).

1.2. Introduction

In this paper, the MedISeg model is divided into six implementation stages, namely pre-training model, data preprocessing, data augmentation, model implementation, model inference, and result postprocessing.

For each trick, this paper uses two typical segmentation baseline models, namely 2D-UNet and 3D-UNet, on four medical image segmentation datasets, namely 2D ISIC 2018[, 2D CoNIC, 3D KiTS19 and 3D LiTS On, the experiment explores its effectiveness.

The figure below is a graphical representation of the investigated MedISeg techniques and their potential relationships:
insert image description here

The main contributions of this paper can be summarized as follows:

  • This paper collects a series of MedISeg tricks for different implementation stages, and experimentally explores the effectiveness of these tricks on a consistent CNNs baseline model.
  • This paper clearly demonstrates the effectiveness of these tricks, and extensive and robust experimental results on 2D and 3D medical image datasets compensate for implementation oversights in MedISeg.
  • This article open-sources a robust MedISeg resource library that includes a wealth of segmentation techniques, each with plug-and-play advantages.
  • This landmark work will facilitate follow-up work to compare the experimental results of the MedISeg model in a fair setting.
  • The work in this paper will provide practical guidance for future broad medical image processing especially segmentation challenges.

2. Preliminaries

2.1. Baselines

  • 2D-UNet
    • 2D-UNet consists of an encoder network and a decoder network
      • The encoder network has four spatially hierarchical stages. Each stage consists of two 3×3 convolutional layers, followed by a ReLU activation function and a global max pooling layer (stride = 2)
      • The decoder network takes as input the output of the encoder network and also has four stages, corresponding to the same spatial encoder stages
  • 3D-UNet
    • The network structure of 3D-UNet is almost the same as that of 2D-UNet, which consists of a 3D encoder network and a 3D decoder network
      • In each decoder stage, a 2D transposed convolution operator first upsamples the image by a factor of 2 through a bilinear interpolation operation
      • Then deploy two 3×3 convolutional layers and a ReLU activation function in sequence
      • Especially in the last decoding network layer, the channel size of the output feature map is assigned to the class size of the used dataset via a 2×2 convolutional layer
  • The difference between 3D-UNet and 2D-UNet in implementation is:
    • 2D convolutional layers are replaced by 3D convolutional layers
    • A lateral connection is added between the same layer in the encoder stage and the decoder stage with the same space and channel size
    • A normalization layer is implemented on the input image

2.2. Datasets

  • 2D ISIC 2018
    • ISIC 2018 is one of the most representative and challenging 2D skin lesion boundary segmentation datasets in the field of computer-aided diagnosis, which consists of 2,594 JPEG dermoscopic images and 2,594 PNG GT images

insert image description here

  • 2D CoNIC
    • The images are from the Lizard dataset, which includes six nuclear classes (namely "Epithelial cells", "Connective tissue cells", "Lymphocytes", "Plasma cells", "Neutrophils" and "Eosinophils") and A "background".
      insert image description here
  • 3D Kits19
    • A total of 210 annotated patient images of high-quality 3D abdominal computed tomography images are publicly available
      insert image description here
  • 3D LiTS
    • LiTS presents a benchmark for contrast-enhanced abdominal computed tomography. LiTS contains 130 scanned images for training and 70 scanned images for testing.
      insert image description here

3. Methods and Experiments

3.1. Pre-Training Model

The pre-training model (that is, using the pre-trained weights as the initialization parameters of the model fine-tuning) provides favorable parameters, so that the training convergence is easy to speed up, and the potential model can obtain a strong generalization ability. Different pre-training models have obvious differences. Impact.

The large-scale neural networks currently used often require tens of millions of data sets to be trained from scratch to a usable level, and data in the medical field is not easy to obtain (involving patient privacy and other issues), so in medical images Pretrained weights are often used in segmentation tasks.

Pretrained weights can be divided into two broad categories: fully supervised weights and self-supervised weights.

  • PyTorch official weights
    • torchvision.modelsThese pretrained weights are obtained by ImageNet 1ktraining the corresponding backbone network for the single-label image classification task on the dataset, where 1kdenotes that the dataset 1000consists of classes of common scenes.
  • Model-oriented ImageNet 1k weights
    • In addition to PyTorch's official pre-trained weights, model creators usually publish pre-trained weights on ImageNet
    • For example, model-oriented ImageNet 1kweights can be obtained by ImageNet 1ktraining on the dataset ResNet-50for image classification
  • Model-oriented ImageNet 21k weights
    • ImageNet 21kis a more general and comprehensive version of the dataset with a total 21,000of object categories for (weakly or semi-supervised or fully supervised) image classification.
    • Therefore, ImageNet 21kthe training weights above are more conducive to improving the recognition performance of downstream computer vision models
  • SimCLR weights
    • SimCLRDemonstrates that introducing a learnable non-linear transformation between image representation and contrastive learning loss can greatly improve the representation quality of the model
    • three implementation steps
      • First group the input image into some image patches
      • Then implement different data augmentation strategies for different batches of image patches
      • Finally the model is trained to obtain similar results for different augmentations of the same image patches and mutually exclude other results
  • MOCO weights
    • MOCOis one of the classic self-supervised contrastive learning methods
    • It is designed to solve the problem of inconsistency in sampling characteristics in memory banks
  • Model genesis (ModelGe) weights
    • ModelGeIt is an advanced self-supervised model pre-training technique, usually composed of four transformation operations (non-linear, local pixel shuffling, out-painting, and in-painting), for computer tomography and magnetic resonance imaging images Single Image Restoration

insert image description here

3.2. Data Pre-Processing

Due to the data specificity of 3D medical images for deep CNNs, data preprocessing is necessary to obtain satisfactory recognition performance. In this subsection, we mainly discuss the effectiveness of four common image preprocessing techniques in 3DUNet. The four strategies are: patching , oversampling(OverSam) , resampling (ReSam) , and intensity normalization (IntesNorm)

  • Patching
    • Some specific types of medical images (such as magnetic resonance images and pathological images) are often very large in spatial size and lack sufficient samples in terms of quantification. Therefore it is not possible to directly train medical image segmentation samples using these images.
    • A common practice is to resample the image into different image patches of smaller spatial scales, so that the model can be implemented with less GPU memory cost and the model can be trained better.
    • Specifically, the size of the patch is one of the most important factors affecting the performance of the model.
  • OverSam
    • To address the class imbalance between positive and negative samples, an oversampling strategy is proposed.
    • The OverSam strategy is mainly used for minority sample classes.
    • Currently existing oversampling strategies include: random oversampling, syn-thetic minority oversampling (SMOTE), borderline SMOTE, and the adaptive synthetic sampling
  • ReSam
    • The resampling strategy improves the representation power of the dataset used by the machine learning model. Since the available sample power is sometimes limited and heterogeneous, better subsampled data can be obtained through random OR non-random resampling strategies.
    • In its implementation process, resampling mainly includes 4 steps:
      • spacing interpolation
      • window transformation
      • Get valid mask range
      • Generation of sub-images
  • IntesNorm
    • IntersNorm is a specific normalization strategy, mainly used in medical images. Usually there are two ways
      • Z-Score for all modalities
      • computed tomography image

insert image description here

3.3. Data Augmentation

For medical images, data augmentation is usually used to deal with the problem of insufficient quantitative training samples, which can be used to alleviate the problem of overfitting, endow the model with strong generalization ability, and endow robustness.

Data augmentation schemes for MedISeg can be mainly divided into the following two categories: data augmentation based on geometric transformation (GTAug), and data augmentation based on generative confrontation network (GAN) (GANAug)

  • GTAug
    • GTAugIt is proposed to remove the influence of some geometric object changes in the training images, such as position, scale and perspective
    • Commonly used GTAuginclude flipping, cropping, rotating, translating, color dithering, contrast, simulated low resolution, Gaussian noise injection, blending images, random wiping, Gaussian blur, blending, and shear blending
  • GANAug
    • The inherent premise of data augmentation is to introduce domain knowledge or other incremental information into the training dataset
    • GANAugCan be viewed as a loss function that focuses on guiding the network to generate some real data that is close to the domain of the source dataset

insert image description here

3.4. Model Implementation

A MedISeg model usually consists of many implementation details, and each unnoticed transformation can have a potential impact on performance.

This paper selects three classes of commonly used implementation techniques and explores their segmentation effectiveness. The three categories are: deep supervision (DeepS); class balancing loss (CBL), including four loss functions (CBLDice, CBLFocal, CBLTvers and CBLWCE); online hard example mining (OHEM); and instance normalization (IntNorm)

  • DeepS
    • Deep supervision is an auxiliary learning technique. It is implemented by directly or indirectly adding auxiliary classifiers or segmenters to some intermediate hidden layers, so as to supervise the backbone network, which can be used to solve the problem of gradient disappearance or slow convergence speed, and for image segmentation, this technique This is usually achieved by adding an image-level classification loss.
  • CBL
    • CBL is often used to learn general class weights, i.e., the weights of each class are only related to the object category. Compared to some traditional segmentation loss functions (e.g., cross-entropy loss) on class-imbalanced datasets. CBL can improve the representation ability of the model.
    • In the datasets used, CBL introduces effective sample numbers to represent the expected volumetric representation of the selected dataset, and weights the different classes according to effective sample numbers rather than raw sample numbers.
  • OMG
    • The core idea of ​​OHEM is to first filter out some difficult learning samples (i.e., images, objects, and pixels) through a loss function. These selected hard samples all have a strong impact on the recognition task.
    • These samples are then applied to gradient descent during model training.
  • IntNorm
    • IntNorm is a popular normalization algorithm for recognition tasks that place higher demands on individual pixels. In its implementation, all elements of each individual sample and individual channel of the sample are considered when calculating the statistical normalization.
    • In the field of medical imaging, an important reason for using IntNorm is that the batch size is usually set to a small value during training (especially for 3D images).

insert image description here

3.5. Model Inference

Using reasoning skills is an important strategy to improve recognition performance. This paper mainly discusses two commonly used reasoning skills, namely test time augmentation (TTA) and model integration (Ensemble).

  • TTA.
    • TTAIt is a popular data augmentation mechanism in the model inference stage to improve the accuracy of the model. TTA can be used to improve recognition performance without training, so it has the potential to be a plug-and-play tool. At the same time, it can improve the ability of model calibration, which is beneficial for vision tasks.
  • Ensemble.
    • EnsembleThe strategy aims to combine multiple well-trained models, and achieve multi-model fusion results on the test set based on a certain collection mechanism, so that the final result can be learned from each well-trained model to improve the overall generalization ability
    • Commonly used model ensemble methods include voting, averaging, stacking, and non-cross-stacking (blending)

insert image description here

3.6. Result Post-Processing

The purpose of post-processing operations is to improve model performance through non-learnable methods, such as refining segmentation results by aggregating global information. Includes two result post-processing methods as Maximum Component Suppression (ABL-CS) and Small Area Removal (RSA)

  • ABL-CS
    • ABL-CS aims to remove some erroneous regions in the segmentation results based on the knowledge of the physical properties of organisms
  • RSA
    • In the field of medical image segmentation, the imaging protocol is usually constant, so the area of ​​each instance segmentation mask is also kept constant. Based on this physical property, we can set a pixel-level threshold to remove some instance masks that are too small (i.e., below a given threshold) among the obtained segmentation masks.

insert image description here


Summarize

blog reference

Guess you like

Origin blog.csdn.net/HoraceYan/article/details/128483969