MIA literature reading——The latest progress and clinical application of deep learning in medical image analysis [2022]

0 abstract


Deep learning has gained widespread research interest in developing new medical image processing algorithms, and deep learning-based models have been very successful in various medical imaging tasks to support disease detection and diagnosis. Despite their success, further improvements in deep learning models in medical image analysis have been largely due to the lack of large-scale and well-annotated datasets . Over the past five years, much research has focused on addressing this challenge. In this article, we review and summarize these recent studies to provide a comprehensive overview of the application of deep learning methods in various medical image analysis tasks. It focuses on the latest progress and contributions of unsupervised and semi-supervised deep learning in medical image analysis based on different application scenarios such as classification , segmentation , detection and image registration . We also discuss the main technical challenges and propose possible solutions in future research efforts.

1 Introduction


In current clinical practice, the accuracy of detection and diagnosis of cancer and/or many other diseases depends on the expertise of individual clinicians (e.g., radiologists, pathologists) (Kruger et al., 1972), which results in There are large differences between readers who read and interpret medical images. To address and overcome this clinical challenge, many computer-aided detection and diagnosis (CAD) solutions have been developed and tested, aiming to help clinicians read medical images more efficiently and make diagnostic decisions in a more accurate and objective manner. The scientific rationale for this approach is that the use of computer-assisted quantitative image feature analysis can help overcome many negative factors in clinical practice , including wide variation in clinician expertise, potential fatigue of human experts, and a lack of adequate medical resources.

Although early CAD schemes were developed in the 1970s (Meyers et al, 1964; Kruger et al, 1972; Sezaki and Ukena, 1973), the progress of CAD schemes has accelerated since the mid-1990s (Doi et al. , 1999), due to the development and integration of more advanced machine learning methods or models into CAD programs. For traditional CAD solutions, a common development process includes three steps: object segmentation, feature calculation, and disease classification. For example, Shi et al. (2008) developed a CAD scheme to achieve quality classification of digital mammograms. The ROIs containing the target mass are first segmented from the background using an improved active contour algorithm (Sahiner et al., 2001). Then a large number of image features are used to quantify the size, shape, edge geometry, texture and other characteristics of the lesions. Thus, the original pixel data is converted into a representative feature vector. Finally, a classification model based on LDA (linear discriminant analysis) is applied on the feature vector to identify malignant tumors.

In contrast, for deep learning-based models, the hidden patterns within ROIs are gradually identified and learned through the hierarchical architecture of deep neural networks (LeCun et al, 2015). In this process, important attributes of the input image are gradually identified and amplified for certain tasks (such as classification, detection), while irrelevant features are attenuated and filtered out. For example, MRI images depicting suspicious liver lesions are presented with arrays of pixels (Hamm et al., 2019), with each entry used as an input feature to the deep learning model. The first few layers of the model can initially obtain some basic lesion information, such as the shape, location and direction of the tumor. The next batch of layers can identify and maintain features relevant to the lesion's malignancy (e.g., shape, edge irregularities) while ignoring irrelevant changes (e.g., location). Relevant features will be further processed and combined in a more abstract manner by subsequent higher layers. When the number of layers is increased, higher-level feature representation can be achieved. Throughout the process, important features hidden in the original image are identified in a self-learning manner by models based on general neural networks, eliminating the need for manually developed features.

Due to its huge advantages, deep learning-related methods have become mainstream technologies in the field of CAD and are widely used in various tasks, such as disease classification (Li et al., 2020a; Shorfuzzaman and Hossain, 2021; Zhang et al., 2020a; Frid-Adar et al., 2018a; Kumar et al., 2016, 2017), ROI segmentation (Alom et al., 2018; Yu et al., 2019; Fan et al., 2020), medical target detection (Rijthoven et al., 2018 ; Mei et al., 2021; Nair et al., 2020; Zheng et al., 2015) and image registration (Simonovsky et al., 2016; Sokooti et al., 2017; Balakrishnan et al., 2018). Among various deep learning techniques, supervised learning was first used in medical image analysis. Although it has been successfully used in many applications (Esteva et al., 2017; Long et al., 2017), in many cases further deployment of supervised models is mainly hindered by the limited size of most medical datasets. Compared to conventional datasets in computer vision, medical image datasets typically contain relatively few images (e.g., less than 10,000), and in many cases only a small portion of the images are annotated by experts . To overcome this limitation, unsupervised and semi-supervised learning methods have received widespread attention in the past three years , which are able to (1) generate more labeled images for model optimization, (2) generate more labeled images from unlabeled image data learn meaningful hidden patterns, and (3) generate pseudo labels for unlabeled data.

There have been many excellent review articles summarizing the application of deep learning in medical image analysis. Litjens et al. (2017) and Shen et al. (2017) reviewed relatively early deep learning techniques, which were mainly based on supervised methods. Recently, Yi et al. (2019) and Kazeminia et al. (2020) reviewed the application of generative adversarial networks in different medical imaging tasks. Cheplygina et al. (2019) investigated how semi-supervised learning and multi-instance learning can be used in diagnosis or segmentation tasks. Tajbakhsh et al. (2020) investigated various methods to deal with dataset limitations (e.g., scarce or weak annotations), particularly in medical image segmentation. In contrast, a major goal of this article is to illustrate how the field of medical image analysis, which is often bottlenecked by limited annotated data, can benefit from some recent trends in deep learning. Our survey shares two characteristics with recent review papers: comprehensive and technically oriented. "Comprehensive" is reflected in three aspects. First, we highlight the application of various promising methods belonging to the “unsupervised” category, including self-supervised, unsupervised, and semi-supervised learning; at the same time, we do not lose sight of the continued importance of supervised methods. Secondly, we introduce the application of the above learning methods in four classic medical image analysis tasks (classification, segmentation, detection and registration), rather than just involving specific tasks. In particular, we discuss in detail deep learning-based object detection, which is rarely mentioned in recent review papers (2019 onwards). We focus on the application of chest x-ray, mammogram, CT and MRI images. All these types of images have many common characteristics, which are interpreted by doctors in the same department (radiology). We also mentioned some general methods applied in other image fields such as histopathology, but potentially for radiology or MRI images. Third, state-of-the-art architectures/models for these tasks are explained. For example, we summarize how to adapt transformers for medical image segmentation from natural language processing, which has not been mentioned in existing review papers. On the “technologically oriented” side, we review in detail recent advances in unsupervised methods. In particular, self-supervised learning is rapidly emerging as a promising direction but has yet to be systematically reviewed in the context of medical vision. A broad audience may benefit from this survey, including researchers with expertise in deep learning, artificial intelligence, and big data, as well as clinicians/medical researchers.

This survey is presented as follows (Figure 1): Section 2 provides an in-depth overview of recent advances in deep learning, focusing on unsupervised and semi-supervised methods . Furthermore, three important performance-enhancing strategies are discussed, including attention mechanisms, domain knowledge, and uncertainty estimation. Section 3 summarizes the main contributions of deep learning techniques in four main tasks: classification, segmentation, detection and registration. Section 4 discusses the challenges of further improving the model and proposes possible perspectives on future research directions for large-scale application of deep learning-based medical image analysis models.

2 Overview of deep learning methods


According to whether the labels of the training data set exist , deep learning can be roughly divided into supervised learning , unsupervised learning and semi-supervised learning . In supervised learning, all training images are labeled and image-label pairs are used to optimize the model. For each test image, the optimized model generates a likelihood score to predict its class label (LeCun et al, 2015). With unsupervised learning, the model analyzes and learns underlying patterns or hidden data structures without labels. If only a small portion of the training data is labeled, the model learns input-output relationships from the labeled data and strengthens the model by learning semantic and fine-grained features from the unlabeled data. This type of learning method is defined as semi-supervised learning (van Engelen and Hoos, 2020). In this section, we first briefly introduce supervised learning, and then mainly review recent advances in unsupervised learning and semi-supervised learning , which can conveniently perform medical imaging tasks with limited annotated data . This article will introduce the popular frameworks of these two learning paradigms accordingly. Finally, we summarize three strategies that can be combined with different learning paradigms, including attention mechanisms, domain knowledge, and uncertainty estimation, to improve the performance of medical image analysis.

2.1 Supervised learning


Convolutional Neural Network ( CNN CNNCNN ) is a deep learning architecture widely used in medical image analysis (Anwar et al., 2018). CNN CNNCNN mainly consists of convolutional layers and pooling layers. Figure 2 shows a simple CNN CNNin the medical image classification taskCNN C N N CNN CNN directly takes images as input, then performs transformations through convolutional layers, pooling layers, and fully connected layers, and finally outputs category-based image likelihoods.


Insert image description here


In each convolutional layer llIn l , use a bunch of coresW = W 1 , ... , W k W ={W_1, ..., W_k}W=W1WkExtract features from the input image and add biases b = b 1 , … , bkb = {b_1, …, b_k}b=b1bk, generate a new feature map W ilxil + bil W^l_i x^l_i + b^l_iWilxil+bil. Then a nonlinear transformation is applied, an activation function σ ( . ) σ(.)σ(.),得到 x k l + 1 = σ ( W i l x i l + b i l ) x^{l+1}_k = σ (W^l_i x^l_i + b^l_i) xkl+1=s ( Wilxil+bil) as the input to the next layer. After the convolutional layer, a pooling layer is added to reduce the dimensionality of the feature map, thereby reducing the number of parameters. Average pooling and max pooling are two common pooling operations. Repeat the process for the remaining layers. At the end of the network, a fully connected layer is usually used to passthe sigmoid sigmoidsigmoid s o f t m a x softmax The so f t max function generates the probability distribution of the class. The predicted probability distribution gives the labely ´ y´y ´ , so the loss functionL ( y ´ , y ) L(y´,y)L(y´,y ),其中yyy is the actual label. Network parameters are iteratively optimized by minimizing the loss function.

2.2 Unsupervised learning

2.2.1 Autoencoders


Autoencoders are widely used in dimensionality reduction and feature learning (Hinton and Salakhutdinov, 2006). The simplest autoencoder, originally called auto-associator auto-associatorautoa ssoc ia t or (Bourlard and Kamp, 1988), is a neural network with only one hidden layer that learns a latent feature representation of the input data by minimizing the reconstruction loss between the input and its reconstruction from the latent representation . The shallow structure of a simple autoencoder limits its representation ability, while a deep autoencoder with more hidden layers can improve its representation ability. By stacking multiple autoencoders and optimizing them in a greedy hierarchical manner, deep autoencoders or stacked autoencoders (SAE SAES A E ) can learn more complex nonlinear patterns than shallow autoencoders, thereby better generalizing external training data (Bengio et al., 2007). SAE SAESAE consists of an encoder network and a decoder network, which are usually symmetrical . To further force the model to learn useful latent representations with desirable features, regularization terms such as sparsity constraints in sparse autoencoders (Ranzato et al., 2007) can be added to the original reconstruction loss. Other regularized autoencoders include denoising autoencoders (Vincent et al., 2010) and shrinkage autoencoders (Rifai et al., 2011), both of which are designed to be insensitive to input perturbations.


An autoencoder is a model where the input is equal to the output. The simplest autoencoder composed of a fully connected neural network has only a three-layer structure. The hidden layer in the middle is what we need to pay attention to. With the hidden layer as the boundary, the encoder is on the left , the right side is the decoder, so during the training process, the input can be decoded and restored to its original appearance after encoding. The error between the original data and the output data is called the reconstruction error .


If we train our autoencoder through a set of data, and then remove the decoder of the autoencoder, we can use the remaining encoder to represent our data. The number of neurons in the hidden layer is far Lower than the input layer, it means that we use fewer features (neurons) to characterize our input data, thereby achieving the dimensionality reduction function

  • Sparse autoencoder: that is, the hidden layer of the ordinary autoencoder plus an L 1 L1L 1 regular term, which is a training penalty term, so that the features represented by the encoder we trained will be more sparse, so that we can get few and useful feature items.
  • Noise reduction encoder: The input is replaced by a noised data set, and the output is an autoencoder trained with the original data set, with the purpose of learning the noise reduction function.
    Insert image description here

With the classic autoencoders mentioned above autoencodersa u t oe n co d ers are different, variationalautoencoder autoencodera u t o e n co d er (UAE UAEV A E ) (Kingma, 2014) works probabilistically to learn the mapping space between observationsx ∈ R mx∈R^mxRm and the potential spacez ∈ R n ( m >> n ) z∈R^n(m>>n)zRn(m>>n ) . As a latent variable model,VAE VAEV A E formulate this problem as maximizing the logarithm of the observed samplelogp ( x ) = log ∫ p ( x ∣ z ) p ( z ) dz log p(x) =log ∫p (x | z) p (z )dzlogp(x)=logp(xz)p(z)dz p ( x ∣ z ) p (x | z) p ( x z ) can be easily modeled using neural networks, andp ( z ) p (z)p ( z ) is the latent space of a prior distribution (such as Gaussian). However, integration is intractable because it is impossible to sample the entire latent space. Therefore, according to Bayes’ rule, the posterior distributionp ( z ∣ x ) p(z|x)p ( z x ) also becomes intractable. To solve this tricky problem,VAE VAEThe authors of V A E proposed that in addition to using a decoder to modelp ( x ∣ z ) p (x | z)In addition to p ( x z ) , the encoder also learnsq ( z ∣ x ) q(z|x) thatq ( z | x ) . Finally, forlogp (x) log p (x)l o g p ( x ) can derive a tractable lower bound, also known as the evidence lower bound (EBLO).

l o g p ( x ) ≥ E L B O = E q ( z ∣ x ) [ l o g p ( x ∣ z ) ] − K L [ q ( z ∣ x ) ∣ ∣ p ( z ) ] log p(x)≥ELBO = E_{q(z|x)}[log p(x|z)]−KL[q(z|x)||p(z)] logp(x)E L BO=Eq(zx)[logp(xz)]K L [ q ( z x ) ∣∣ p ( z )] , whereKL KLKL代表 K u l l b a c k − L e i b l e r Kullback-Leibler KullbackL e ib l er divergence. The first term can be understood as the reconstruction loss, measuring the similarity between the input image and the corresponding image reconstructed from the latent representation. The second term computes the divergence between the approximate posterior and the Gaussian prior (Table 1-4).

Later a different VAE VAE was proposedV A E is extended to learn more complex representations. Although the probabilistic working mechanism allows its decoder to generate new data,VAE VAEV A E has no specific control over the data generation process. Sohn et al. (2015) proposed the so-called conditionalVAE VAEV A E (CVAE CVAEC V A E ), where the probability distributions learned by the encoder and decoder are both conditioned using external information (e.g., image class). This makesVAE VAEV A E can generate structured output representations. Another study explores imposing more complex priors on the latent space. For example, Dilokthanakul et al. (2016) proposed a Gaussian mixtureVAE VAEVAE ( G M V A E GMVAE GM V A E ), which uses a prior mixture of Gaussians to achieve higher modeling capacity in the latent space. We refer readers to a recent paper (Kingma and Welling, 2019) to understandVAE VAEMore details on V A E and its extensions.

2.2.2 Generative Adversarial Networks (GANs)


Generative adversarial networks (GANs) are a type of deep network used for generative modeling, first proposed by Goodfellow et al. (2014). For this architecture, a framework for estimating generative models is designed to draw samples directly from the desired underlying data distribution without explicitly defining the probability distribution. It consists of two models: generator GGG and discriminatorDDD. _ Generative modelGGG will be from the prior distributionP z ( z ) P_z (z)PzRandom noise vector zzsampled in ( z )z is taken as input, usually a Gaussian distribution or a uniform distribution, and thenzzz is mapped to the data space asG ( z , θ g ) G(z, θ_g)G(z,ig) , whereGGG is the parameterθ g θ_gigneural network. Expressed as G ( z ) G(z)G ( z ) orxg x_gxgFake samples and training data P r ( x ) Pr(x)The real samples in P r ( x ) are similar, these two types of samples are sent to DDD. _ The discriminator is the second one parameterized asθ d θ_didNeural network, which outputs a sample from the training data instead of GGProbability of G D ( x , θ d ) D(x,θ_d)D(x,id) . The training process is like playing a two-player game of Minimax. Discriminative networkDDD is optimized to maximize the log-likelihood of assigning correct labels to fake and real samples, for the generative modelGGG is trained to maximizeDDLog-likelihood of D being wrong. Through the adversarial process, it is expected thatGGG gradually estimates the underlying data distribution and generates real samples.

On this basis, the performance of GAN is improved from two aspects: 1) different loss (objective) functions and 2) condition settings. For the first direction, Wasserstein GAN (WGAN) is a typical example. In WGAN, Earth-Mover (EM) distance or Wasserstein-1 commonly known as Wasserstein distance is proposed to replace Jensen-Shannon (JS) divergence in original vanilla GAN and measure the difference between real data distribution and synthetic data distribution distance (Arjovsky et al, 2017). The advantage to WGAN's critics is that it provides useful gradient information in places where JS divergence saturates and causes vanishing gradients. WGAN can also improve the stability of learning and alleviate problems such as model collapse.

Unconditional generative models cannot explicitly control the pattern of the data being synthesized. To guide the data generation process, a conditional GAN ​​(cGAN) is built by conditioning its generator and discriminator with additional information (i.e., class labels) (Mirza and Osindero, 2014). Specifically, the noise vector z and the class label c are jointly provided to G; the real/false data and the class label c are together used as the input of d. The condition information can also be an image or other attributes and is not limited to the class label. Furthermore, Auxiliary Classifier GAN (ACGAN) proposes another strategy that uses label conditioning to improve image synthesis (Odena et al., 2017). Unlike the discriminator of cGAN, D in ACGAN no longer provides class condition information. In addition to separating real and fake images, D is also responsible for reconstructing class labels. When D is forced to perform additional classification tasks, ACGAN can easily generate high-quality images.

2.2.3 Self-supervised learning


Over the past few years, unsupervised representation learning has achieved great success in natural language processing (NLP), where large amounts of unlabeled data are available to pretrain models (e.g., Bert, Kenton, & Toutanova, 2019) and learn useful Feature representation. The feature representation is then fine-tuned in downstream tasks such as question answering, natural language reasoning, and text summarization. In computer vision, researchers have explored a similar pipeline— first training models to learn rich and meaningful feature representations from raw unlabeled image data in an unsupervised manner , and then performing various downstream tasks on the labeled data. Fine-tuning feature representations , such as classification, target detection, instance segmentation, etc. However, for quite some time this approach has not been as successful as NLP, and instead supervised pre-training has been the main strategy. Interestingly, we have found that this situation is changing in the opposite direction in the past two years, with more and more studies showing that self-supervised pre-training is more effective than supervised pre-training .

In recent literature, the term "self-supervised learning" is used interchangeably with "unsupervised learning"; more precisely, self-supervised learning actually refers to a form of deep unsupervised learning in which inputs and labels are processed without external supervision. The case is created from the unlabeled data itself. An important motivation behind this technique is to avoid supervised tasks, which are often expensive and time-consuming due to the need to build new labeled datasets or obtain high-quality annotations in certain fields (such as medicine). Despite the scarcity and high cost of labeled data, there is often a large amount of cheap unlabeled data that remains untapped in many fields. Unlabeled data may contain valuable information that is either weak or not present in the labeled data. Self-supervised learning can harness the power of unlabeled data to improve the performance and efficiency of supervised tasks. Since self-supervised learning involves a wider range of data than supervised learning, features learned in a self-supervised manner can generalize better in the real world. Self-supervision can be performed in two ways: pre-training task-based methods and contrastive learning-based methods. Since methods based on contrastive learning have received widespread attention in recent years, we will focus on more work in this direction.

The pre-training task is designed to learn representative features for downstream tasks, but pre-training itself is not really of interest (He et al, 2020). The pre-training task learns representations by hiding specific information (e.g., channels, patches, etc.) of each input image and then predicts the missing information from the rest of the image. Examples include image rendering (Pathak et al., 2016), coloring (Zhang et al., 2016), relative patch prediction (Doersch et al., 2015), puzzle (Noroozi and Favaro, 2016), rotation (Gidaris et al., 2018), etc. . However, the generalizability of learned representations relies heavily on the quality of hand-crafted pre-training tasks (Chen et al, 2020a).

Contrastive learning relies on the so-called contrastive loss, which goes back at least as far as (Hadsell et al., 2006; Chopra et al., 2005). Many variations of this contrastive loss were later used (Oord et al., 2018; Chen et al., 2020a; Chaitanya et al., 2020). Essentially, both the original loss and its subsequent versions enforce a similarity measure that is maximized for positive (similar) pairs and minimized for negative (dissimilar) pairs so that the model can learn discriminative features. Below we will introduce two representative frameworks for contrastive learning, namely Momentum Contrast (MoCo) (He et al., 2020) and SimCLR (Chen et al., 2020a).

MoCo describes contrastive learning as a dictionary lookup problem that requires an encoded query to be similar to its matching key. As shown in Figure 3(a), given an image x, the encoder encodes the image to produce a feature vector, which is used as the query (q). Likewise, using another encoder, a dictionary can be built from a large number of image samples {x0, x1, x2,…} via features {k0, k1, k2,…} (also called keys). In MoCo, encoding query q and key are considered similar if they come from different crops of the same image. Assuming there is a dictionary key (k+) that matches q, then these two items are considered positive pairs and the other keys in the dictionary are considered negative pairs. The author uses InfoNCE (Oord et al, 2018) to calculate the loss function of the positive pair as follows:

Insert image description here

Built from a sampled subset of all images, a large dictionary is important for good accuracy. To make the dictionary larger, the authors maintain feature representations of previous image batches as a queue: new keys are added to the queue and old keys are dequeued. Therefore, the dictionary consists of encoded representations from the current batch and previous batches. However, this can cause the key encoder to update quickly, making the dictionary keys inconsistent, that is, they compare inconsistently with the encoded query. Therefore, the authors recommend using momentum updates on the key encoder to avoid rapid changes. This key encoder is called a momentum encoder.

SimCLR is another popular contrastive learning framework. In this framework, two augmented images are considered a positive pair if they come from the same example; if not, they are a negative pair. Maximize the consistency of feature representation from positive images. As shown in Figure 3(b), SimCLR consists of four parts: (1) random image enhancement; (2) the encoder network (f(.)) extracts feature representations from the enhanced images; (3) maps the feature representations Small neural network (multilayer perceptron (MLP) projection head) to low-dimensional space (g (.)); (4) Contrast loss calculation. The third component makes SimCLR different from its predecessors. Previous frameworks, such as MoCo, directly compute feature representations instead of first mapping them into a low-dimensional space. This component further demonstrates its importance in obtaining satisfactory results, as demonstrated with MoCo v2 (Chen et al., 2020b).

It is worth noting that since self-supervised contrastive learning is very new, the widespread application of recent advances such as MoCo and SimCLR in the field of medical image analysis has not yet been established at the time of writing this article. Nonetheless, given the promising results of self-supervised learning reported in the existing literature, we anticipate that research applying this new technique to analyze medical images may soon explode. Furthermore, self-supervised pre-training has great potential to become a powerful alternative to supervised pre-training.

2.3. Semi-supervised learning


Unlike unsupervised learning, which can only work on unlabeled data to learn meaningful representations, semi-supervised learning (SSL) combines labeled and unlabeled data during model training . In particular, SSL is suitable for scenarios where limited labeled data and large-scale but unlabeled data are available . These two types of data should be related so that the additional information carried by the unlabeled data can be used to compensate for the labeled data. It's reasonable to expect that unlabeled data will lead to average performance improvements - for tasks that only use limited labeled data, more may be better. In fact, this goal has been explored for decades, and the 1990s have seen the rise of applying SSL methods in text classification. The semi-supervised learning book (Chapelle et al, 2009) is a good source for readers to grasp the connection between SSL and classic machine learning algorithms. Interestingly, despite its potential positive value, the authors present empirical findings that unlabeled data sometimes worsens performance. However, this empirical finding seems to have changed in the recent deep learning literature - an increasing number of works (mainly from the field of computer vision) report that deep semi-supervised methods often perform better than high-quality supervised baselines ( Ouali et al., 2020). Even changing the amount of labeled and unlabeled data, consistent performance improvements are still observed. At the same time, deep semi-supervised learning has been successfully applied in the field of medical image analysis, reducing annotation costs and achieving better performance. We classify popular SSL methods into three categories: (1) methods based on consistency regularization; (2) methods based on pseudo-annotation; (3) methods based on generative models.

The first category of methods all have the same idea, that is, the predictions for unlabeled examples should not change significantly if some perturbation is applied (e.g., adding noise, data augmentation). The loss function of the SSL model generally consists of two parts. More specifically, given an unlabeled data example x and its perturbed version x³, the SSL model outputs logits fθ (x) and fθ (x³). On unlabeled data, the goal is to give consistent predictions by minimizing the mean square error d(fθ (x), fθ (x)), which leads to the consistency (unsupervised) loss Lu on unlabeled data . On the labeled data, compute the cross-entropy supervised loss Ls. Examples of SSL models regularized by consistency constraints include Ladder Networks (Rasmus et al., 2015), ρModel (Laine and Aila, 2017), and Temporal Ensembling (Laine and Aila, 2017). A recent example is the average teacher paradigm (Tarvainen and Valpola, 2017), which consists of a teacher model and a student model (Figure 4). The student model is optimized by minimizing Lu on unlabeled data and minimizing Ls on labeled data; the teacher model serves as the exponential moving average (EMA) of the student model, which is used to guide the student model for consistent training. Recently, several works such as Unsupervised Data Augmentation (UDA) (Xie et al., 2020) and MixMatch (Berthelot et al., 2019) have improved the performance of SSL to a new level.

For pseudo-labeling (Lee, 2013), the SSL model itself generates pseudo-annotations for unlabeled examples; pseudo-labeled examples are used in conjunction with labeled examples to train the SSL model. The process went through multiple iterations, and both the quality of the pseudo-labels and the performance of the model improved. The naïve pseudo-labeling process can be combined with Mixup enhancement (Zhang et al., 2018a) to further improve the performance of SSL models (Arazo et al., 2020). Pseudo annotations can also be used well for multi-view collaborative training (Qiao et al., 2018). For each view's labeled examples, co-training learns a separate classifier and then uses that classifier to generate pseudo-labels for the unlabeled data; co-training maximizes the distribution of pseudo-annotations across each view of the unlabeled examples consistency.

For the third category of methods, semi-supervised generative models such as GANs and VAEs focus more on solving target tasks (such as classification) rather than just generating high-fidelity samples. Here we briefly explain the mechanism of semi-supervised GAN. A simple way to adapt GANs to semi-supervised settings is to modify the discriminator to perform additional tasks. For example, in the image classification task, Salimans et al (2016) and Odena (2016) changed the discriminator of DCGAN by forcing DCGAN to act as a classifier. For unlabeled images, the discriminator functions like a normal GAN, providing the probability that the input image is real; for labeled images, the discriminator predicts its category in addition to generating the true probability. However, Li et al. (2017) demonstrated that a single discriminator may not achieve optimal performance for both tasks simultaneously. Therefore, they introduced an additional classifier that is independent of the generator and discriminator. This new architecture of three components is called triple gan

2.4 Strategies to improve performance

2.4.1 Attention mechanism


Attention arises from primate visual processing mechanisms, which select subsets of relevant sensory information rather than using all available information for complex scene analysis (Itti et al, 1998). Inspired by this idea of ​​focusing on specific parts of the input, deep learning researchers have integrated their attention into developing advanced models in different domains. Attention-based models have achieved great success in fields related to natural language processing (NLP), such as machine translation (Bahdanau et al., 2015; Vaswani et al., 2017) and image captioning (Xu et al., 2015; You et al., 2015). al, 2016;Anderson et al, 2018). A prominent example is the Transformer architecture, which relies entirely on self-attention to capture global dependencies between inputs and outputs without the need for sequential computation (Vaswani et al, 2017). Attention mechanisms are also popular in computer vision tasks, such as natural image classification (Wang et al., 2017; Woo et al., 2018; Jetley et al., 2018), segmentation (Chen et al., 2016; Ren and Zemel, 2017) etc. When processing images, the attention module can adaptively learn "what" and "where" to attend so that model predictions are conditioned on the most relevant image regions and features. According to the selection of the focused position in the image, the attention mechanism can be roughly divided into two categories: soft attention and hard attention. The former learns the feature weighted average of all locations deterministically, while the latter randomly draws a subset of feature locations to participate (Cho et al, 2015). Since hard attention is non-differentiable, soft attention, despite being computationally more expensive, has received more research effort. According to this differentiable mechanism, different types of attention have been further developed, such as (1) spatial attention (Jaderberg et al., 2015), (2) channel attention (Hu et al., 2018a), (3) spatial and channel attention A combination of (Wang et al., 2017; Woo et al., 2018) and (4) self-focus (Wang et al., 2018). Readers are referred to the excellent review by Chaudhari et al. (2021) for more details on the attention mechanism.

2.4.2 Domain knowledge


Most mature deep learning models were originally designed to analyze natural images and may only produce suboptimal results when directly applied to medical image tasks (Zhang et al., 2020a). This is because natural images and medical images are very different in nature. First, medical images often exhibit high inter-class similarities, so a major challenge lies in extracting fine-grained visual features to understand the subtle differences that are important for making correct predictions. Second, typical medical image datasets are much smaller than baseline natural datasets containing tens to millions of images. This hinders the direct application of highly complex models in computer vision in the medical field. Therefore, how to customize medical image analysis models remains an important issue. One possible solution is to integrate appropriate domain knowledge or task-specific attributes, which has been shown to be beneficial in facilitating learning of useful feature representations and reducing model complexity in medical imaging environments. In this review article, we will mention various domain knowledge, such as anatomical information in MRI and CT images (Zhou et al., 2021, 2019a), three-dimensional spatial context information in volumetric images (Zhang et al., 2017; Zhuang et al., 2019; Zhu et al., 2020a), multiple instance data of the same patient (Azizi et al., 2021), patient metadata (Vu et al., 2021), radiological features (Shorfuzzaman and Hossain, 2021), Text reports accompanying images (Zhang et al., 2020a), etc. Readers interested in how to integrate medical domain knowledge into network design can refer to the work of Xie et al. (2021a).

2.4.3 Uncertainty of estimates


Reliability is a key issue when it comes to clinical settings with high safety requirements, such as cancer diagnostics. Model predictions are susceptible to factors such as data noise and inference errors, so uncertainty needs to be quantified and the results trusted (Abdar et al, 2021). Commonly used techniques for uncertainty estimation include Bayesian approximation (Gal and Ghahramani, 2016) and model ensembles (Lakshminarayanan et al., 2017). Bayesian methods such as Monte Carlo dropout (MC-dropout) (Gal and Ghahramani, 2016) revolve around approximating the posterior distribution of neural network parameters. Ensemble techniques combine multiple models to measure uncertainty. Readers interested in uncertainty estimation are referred to the comprehensive review by Abdar et al. (2021).

3 Deep Learning Applications

3.1 Classification


Medical image classification is the goal of computer-aided diagnosis (CADx), which aims to distinguish benign lesions from malignant lesions or to identify certain diseases from input images (Shen et al., 2017; van Ginneken et al., 2011). Deep learning-based CADx solutions have achieved great success over the past decade. However, deep neural networks often rely on sufficient annotated images to guarantee good performance, a requirement that may not be easily satisfied by many medical image datasets. To alleviate the problem of lack of large annotated datasets , many techniques have been used, with transfer learning undoubtedly being the leading paradigm. In addition to transfer learning, several other learning paradigms, including unsupervised image synthesis, self-supervised and semi-supervised learning, have shown great potential for performance enhancement under limited annotated data . We introduce the application of these learning paradigms to medical image classification in the following subsections.

3.1.1 Supervised classification


Starting from AlexNet (Krizhevsky et al., 2012), various end-to-end models with increasingly deeper networks and larger representation capacity have been developed for image classification, such as VGG (Simonyan and Zisserman, 2015), GoogleLeNet (Szegedy et al., 2015), ResNet (He et al., 2016) and DenseNet (Huang et al., 2017). These models achieved excellent results, making deep learning mainstream not only in developing high-performance CADx solutions, but also in other subfields of medical image processing.

然而,深度学习模型的性能在很大程度上取决于训练数据集的大小和图像注释的质量。在许多医学图像分析任务中,特别是在3D场景中,由于数据采集和注释困难,建立足够大且高质量的训练数据集可能具有挑战性(Tajbakhsh等人,2016;Chen et al ., 2019a)。监督迁移学习技术(Tajbakhsh et al ., 2016;Donahue et al ., 2014)通常用于解决训练数据不足的问题并提高模型的性能,其中ResNet (He et al ., 2016)等标准架构首先在源域中使用大量自然图像(例如ImageNet (Deng et al ., 2009))或医学图像进行预训练然后将预训练的模型转移到目标域并使用更少的训练样本进行微调。Tajbakhsh等人(2016)表明,经过充分微调的预训练cnn的表现至少与从头训练的cnn一样好。事实上,迁移学习已经成为各种模式下图像分类任务的基石(de Bruijne, 2016),包括CT (Shin等人,2016)、MRI (Yuan等人,2019)、乳房x光检查(Huynh等人,2016)、x射线检查(Minaee等人,2020)等。

什么是迁移学习?

  • 利用数据,任务,模型间的相似性,将训练好的内容应用到新的任务上被称为迁移学习
  • 由于这一过程发生在两个领域间,已有的知识和数据,也就是被迁移的对象被称为源域,被赋予“经验”的领域被称为目标域

迁移学习不是具体的模型,更像是解题思路,我们使用迁移学习的原因有很多:

  • For example, there is too little data in the target domain, and you need help from the source domain with more labeled data.
  • Sometimes it is to save training time, sometimes it is to achieve personalized applications

In the supervised classification paradigm, different types of attention modules are used to improve performance and better model interpretability (Zhou et al., 2019b). Guan et al. (2018) introduced an attention-guided CNN based on ResNet-50 (He et al, 2016). Use attention heatmaps from global X-ray images to suppress large irrelevant regions and highlight local regions containing discriminative cues for thoracic disease . The model effectively integrates global and local information and achieves good classification performance. In another study, Schlemper et al. (2019) incorporated attention modules into variant networks of VGG (Baumgartner et al., 2017) and U-Net (Ronneberger et al., 2015), respectively, for 2D fetal ultrasound images. Planar classification and 3D CT pancreatic segmentation. Each attention module is trained to focus on a subset of local structures in the input image that contain salient features useful for the target task.

3.1.2 Unsupervised methods


Unsupervised image synthesis: Classic data augmentation (e.g., rotation, scaling, flipping, translation, etc.) simply but effectively creates more training instances for better performance (Krizhevsky et al, 2012). However, it cannot bring much new information to existing training examples. Given that gan has the advantage of learning hidden data distributions and generating realistic images , gan has been used as a more sophisticated data augmentation method in the medical field .

What is GAN (Generative Adversarial Network)?

  • GAN, Generative Adversarial Network, consists of three parts, generation, discrimination and confrontation.
    • The generator is responsible for generating content (text, images, audio, etc.) based on random vectors, depending on what you want to generate
    • The discriminator is responsible for determining whether the received content is authentic, and usually gives a probability that represents the authenticity of the content.
    • Adversarial refers to the alternating training process of GAN. Taking pictures as an example, the fake pictures generated by the generator are given to the discriminator together with the original real pictures, so that it can learn to distinguish between the two, giving high scores to the real ones and low scores to the fake ones. When the discriminator is proficient in judging existing pictures, let the generator target the high-scoring pictures obtained from the discriminator and continuously generate better "better" fake pictures until it can fool the discriminator. Repeat this process until The discriminator’s prediction probability for any picture is close to 0.5, that is, it cannot distinguish whether the picture is true or false, so the training is stopped.
  • The purpose of GAN is to generate enough content to be fake and real.

Frid-Adar et al. (2018b) used DCGAN to synthesize high-quality samples to improve liver lesion classification on a limited data set. This dataset contains only 182 liver lesions, including cysts, metastases, and hemangiomas. Since training GANs usually requires a large number of examples, the authors apply classic data augmentation (e.g., rotation, flipping, translation, scaling) to create nearly 90,000 examples. GAN-based synthetic data augmentation significantly improves classification performance, with sensitivity and specificity increasing from 78.6% and 88.4% to 85.7% and 92.4% respectively. In their later work (Frid-Adar et al., 2018a), the authors further extended lesion synthesis from an unconditional setting (DCGAN) to a conditional setting (ACGAN). The generator of ACGAN is conditioned on side information (lesion classification), and the discriminator predicts the lesion classification while synthesizing new samples. However, we found that the classification performance of acgan-based synthetic enhancement was weaker than that of unconditional enhancement .

To alleviate data scarcity, especially the lack of positive cancer cases, Wu et al. (2018a) employed conditional architecture (cGAN) to generate real lesions for mammogram classification. Traditional data augmentation methods are also used to generate enough samples to train GAN. The generator has malignant/non-malignant markers that control the process of creating specific types of lesions. For each non-malignant plaque image, use the segmentation mask of another malignant lesion to synthesize a malignant lesion on it; for each malignant image, remove its lesion and synthesize a non-malignant plaque. Although the GAN-based augmentation method achieves better classification performance than the traditional data augmentation method, the improvement is relatively small, less than 1%.

Classification based on self-supervised learning : Recent self-supervised learning methods have shown great potential in medical tasks that lack sufficient annotations (Bai et al., 2019; Tao et al., 2020; Li et al., 2020a; Shorfuzzaman and Hossain, 2021 ; Zhang et al., 2020a). This method is suitable for situations where a large number of medical images are available, but only a small part of the images are labeled . Accordingly, model optimization is divided into two steps, namely self-supervised pre-training and supervised fine-tuning. The model is initially optimized using unlabeled images to effectively learn good features that represent image semantics (Azizi et al, 2021). The self-supervised pre-trained model is followed by supervised fine-tuning to achieve faster and better performance in subsequent classification tasks (Chen et al., 2020c). In practice, self-supervision can be created through pre-training tasks (Misra and Maaten, 2020) or contrastive learning (Jing and Tian, ​​2020), as follows.

self-supervised learning

  • Self-supervised learning is a training method that does not require manual annotation and uses the inherent structure and information of the data to generate supervision signals to learn useful representations or models.
  • The core idea of ​​self-supervised learning is to train based on the inherent structure and characteristics of the data. It uses some hidden structure or information in the input data to generate "pseudo labels", and then uses these "pseudo labels" to train the model.
  • For example, in the field of image processing, you can rotate an image by a certain angle as a self-supervised task, and then let the model predict the rotation angle of the image. In this process, the model autonomously learns the spatial relationship and feature extraction capabilities of the image without the need for manual annotation of labels.

Self-supervised pre-training task classification utilizes common pre-training tasks such as rotation prediction (Tajbakhsh et al., 2019) and Rubik's Cube recovery (Zhuang et al., 2019; Zhu et al., 2020a). Chen et al. (2019b) argued that existing pre-training tasks such as relative position prediction (Doersch et al., 2015) and local context prediction (Pathak et al., 2016) only lead to marginal improvements in medical image datasets; the authors designed A pre-training task based on contextual restoration. This new camouflage task is divided into two steps: disorderly patching of the damaged image and restoration of the original image. Context-recovery pre-training strategy improves the performance of medical image classification. Tajbakhsh et al. (2019) utilized three pre-training tasks, namely rotation (Gidaris et al., 2018), colorization (Larsson et al., 2017) and wgan-based patch reconstruction to pre-train the model for the classification task. After pre-training, the model is trained using labeled examples. In the medical field, pre-training based on pre-training tasks is more effective than random initialization and transfer learning (ImageNet pre-training) for diabetic retinopathy classification.

For self-supervised contrastive classification , Azizi et al. (2021) adopted the self-supervised learning framework SimCLR (Chen et al., 2020a) to train models (wider versions of ResNet50 and ResNet-152) for dermatology classification and chest x-ray classification . They first used unlabeled natural images and then pre-trained the model using unlabeled dermatology images and chest x-rays. Feature representations are learned by maximizing the consistency between positive image pairs, which are either two augmented examples of the same image or multiple images from the same patient. The pre-trained model was fine-tuned using fewer labeled dermatology images and chest x-rays. These models achieved a 1.1% higher average AUC for chest X-ray classification than models pretrained using ImageNet, and a 6.7% higher top-1 accuracy for dermatology classification. MoCo (He et al., 2020; Chen et al., 2020b) is another popular self-supervised learning framework for pre-trained models for medical classification tasks, such as diagnosing COVID-19 from CT images (Chen et al., 2021a ) and pleural effusion identification in chest x-rays (sowrrajan et al., 2021). Furthermore, research shows that self-supervised versus pre-training can benefit from the fusion of domain knowledge. For example, Vu et al. (2021) utilized patient metadata (patient number, image laterality, and study number) to construct and select positive pairs from multiple chest X-ray images for MoCo pre-training. Compared with the previous contrastive learning method (sowrrajan et al., 2021) and ImageNet pre-training, this method only uses 1% of the labeled data for pleural effusion classification, and the average AUC is improved by 3.4% and 14.4% respectively.

3.1.3 Semi-supervised learning


Unlike self-supervised methods that only learn useful feature representations from unlabeled data, semi-supervised learning requires integrating unlabeled data with labeled data in different ways to train the model for better performance. Madani et al. (2018a) used a GAN (Kingma et al., 2014) trained in a semi-supervised manner to classify heart disease in chest x-rays with limited labeled data. Unlike ordinary GANs (Goodfellow et al., 2014), this semi-supervised GAN is trained using both unlabeled and labeled data. Its discriminator is improved to predict not only the authenticity of input images, but also the image category (normal/abnormal) of real data. When increasing the number of labeled examples, classifiers based on semi-supervised GANs consistently outperformed supervised CNNs. Semi-supervised GANs are also useful in other classification tasks with limited data, such as CT lung nodule classification (Xie et al., 20119a) and echocardiography left ventricular hypertrophy classification (Madani et al., 2018b). In addition to semi-supervised adversarial methods, consistency-based semi-supervised methods such as ρ-Model (Laine and Aila, 2017) and Mean Teacher (Tarvainen and Valpola, 2017) have also been used to utilize unlabeled medical image data for better classification (Shang et al., 2019; Liu et al., 2020a).

3.2 Split


Medical image segmentation, i.e., identifying sets of pixels or voxels of lesions, organs, and other substructures from background regions, is another challenging task in medical image analysis (Litjens et al., 2017). Among all common image analysis tasks such as classification and detection, segmentation requires the strongest supervision (a large number of high-quality annotations) (Tajbakhsh et al., 2020). Since its introduction in 2015, UNet (Ronneberger et al., 2015) has become perhaps the best-known architecture for segmenting medical images; subsequently, different U-Net variants have been proposed to further improve segmentation performance. From recent literature, we observe that the combination of U-Net and visual transformers (Dosovitskiy et al., 2020) contributes to state-of-the-art performance. Furthermore, many methods based on semi-supervised and self-supervised learning have also been proposed to alleviate the need for large annotated datasets. Therefore, in this section, we will (1) review the original U-Net and its important variants, and summarize useful performance enhancement strategies; (2) introduce the combination of U-Net with transformer, and Mask RCNN (He et al., 2017); 3) covering self-supervised and semi-supervised segmentation methods. Since recent research has mainly focused on applying transformers to segment medical images in a supervised manner, we intentionally place the introduction of transformer-based architectures in the supervised segmentation section. However, it should be noted that this classification does not mean that transformer-based architectures cannot be used in semi-supervised or unsupervised settings, or other medical imaging tasks.

3.2.1 Segmentation model based on supervised learning


U-Net and its variants: In convolutional networks, the high-level coarse-grained features learned by higher layers capture semantics that are beneficial for overall image classification; in contrast, the underlying fine-grained features learned by lower layers contain useful information for precise localization. details (i.e. assigning a class label to each pixel) (Hariharan et al, 2015), which is important for image segmentation. U-Net is built on a fully convolutional network (Long et al, 2015). The key innovation of U-Net is the establishment of a so-called skip connection between opposing convolutional layers and deconvolutional layers, which successfully Connecting features learned at different levels together improves segmentation performance. At the same time, skip connections also help restore the output of the network to the same spatial resolution as the input. U-Net takes a two-dimensional image as input and generates multiple segmentation maps, each segmentation map corresponding to a pixel class.

On the basis of the basic architecture, Drozdzal et al. (2016) further studied the impact of long and short skip connections on biomedical image segmentation. They concluded that adding short hop connections is important for training deep segmentation networks. In one study, Zhou et al. (2018) claimed that pure skip connections between the encoder and decoder subnets of U-Net lead to the fusion of semantically distinct feature maps; they proposed to reduce the semantic gap. In the proposed unet++ model, simple skip connections are replaced by nested dense skip connections. The architecture outperforms U-Net and wide U-Net on four different medical image segmentation tasks.

In addition to redesigning skip connections, Çiçek et al. (2016) replaced all 2D operations with 3D operations, extending 2D U-Net to 3D U-Net for volume segmentation of sparsely annotated images. In addition, miletari et al. (2016) proposed V-Net for 3D MRI prostate volume segmentation. The main architectural difference between U-Net and V-Net is that the forward convolution unit (Figure 5(A)) becomes a residual convolution unit (Figure 5©), so V-Net is also called residual U -Net. Aiming at the problem of imbalance in the number of foreground and background voxels, a loss function based on Dice coefficient is proposed. To address the scarcity of annotated volumes, the authors augment their training dataset with random nonlinear transformations and histogram matching. Gibson et al. (2018a) proposed Dense V-network, which improved the loss function of V-Net binary segmentation and supported multi-organ segmentation of abdominal CT images. Although the authors followed the V-Net architecture, they replaced its relatively shallow downsampling network with a sequence of three dense feature stacks. The combination of tight connection layer and shallow V-Net structure proved its importance in improving segmentation accuracy, and the proposed model significantly improved the Dice score for all organs compared with the multi-atlas label fusion (MALF) method.

When designing a segmentation network based on U-Net, Alom et al. (2018) proposed to integrate the architectural advantages of Recurrent Convolutional Neural Network (RCNN) (Ming and Xiaolin, 2015) and ResNet (He et al, 2016). In their first network (RU-Net), the authors replaced the forward convolution unit of U-Net with the recurrent convolutional layer (RCL) of RCNN (Figure 5(b)), which helps to accumulate useful features to improve segmentation results. In their second network (R2U-Net), the authors further modified RCL using ResNet’s residual unit (Figure 5(d)), which learns the residual function by using shortcut connected identity maps, allowing training Very deep web. Both models achieved better segmentation performance than U-Net and residual U-Net. Dense convolution blocks (Huang et al., 2017) have also demonstrated their advantages in enhancing liver and tumor CT volume segmentation performance (Li et al., 2018).

In addition to redesigned skip connections and modified architecture, U-Net based segmentation methods also benefit from adversarial training (Xue et al., 2018; Zhang et al., 2020b), attention mechanisms (Jetley et al., 2020b) 2018; Anderson et al, 2018; Oktay et al, 2018; Nie et al, 2018; Sinha and Dolz, 2021) and uncertainty estimation (Wang et al, 2019a; Yu et al, 2019; Baumgartner et al, 2019;Mehrtash et al., 2020). For example, Xue et al. (2018) developed an adversarial network for brain tumor segmentation, which consists of two parts: a segmenter and a critic. The segmenter is a U-Net-like network that generates segmentation maps given an input image; the prediction map and the ground-truth segmentation map are sent into the critic network. Training these two components alternately will eventually lead to good segmentation results. Oktay et al. (2018) proposed incorporating attention gates (AGs) into the U-Net architecture to suppress irrelevant features of background areas and highlight important salient features propagated through skip connections. Note: U-Net consistently outperforms U-Net in CT pancreas segmentation. Baumgartner et al. (2019) developed a hierarchical probabilistic model to estimate uncertainty in prostate MR and chest CT image segmentation. The authors employ variational autoencoders to infer uncertainty or ambiguity in expert annotations and use separate latent variables to model segmentation changes at different resolutions.

Transformers for segmentation: Transformers are a set of encoder-decoder network architectures used for sequence-to-sequence processing in NLP (Chaudhari et al., 2021). A key sub-module is called multi-head self-attention (MSA), where multiple parallel self-attention layers are used to generate multiple attention vectors for each input simultaneously. Unlike convolution-based U-Net and its variants, Transformers relies on a self-attention mechanism, which has the advantage of learning complex, long-term dependencies from input images. In the context of medical image segmentation, there are two methods of adapting transformers: hybrid and pure transformers. Hybrid methods combine CNN and transformer, while the latter does not involve any convolution operations

Chen et al. (2021b) proposed TransUNet, the first transformer-based medical image segmentation framework. This architecture combines CNN and Transformer in a cascade manner, where the advantages of one are used to compensate for the limitations of the other. As mentioned before, U-Net and its variants based on convolution operations have achieved satisfactory results. Due to skip connections, low-level/high-resolution CNN features from the encoder containing precise localization information are exploited by the decoder, resulting in better performance. However, due to the inherent locality of convolutions, these models are often weak in modeling long-range relationships. On the other hand, although the Transformer based on the self-attention mechanism can easily capture remote dependencies, the authors found that using the Transformer alone cannot provide satisfactory results. This is because it only focuses on learning the global context and ignores learning low-level details that contain important localization information. Therefore, the authors propose to combine low-level spatial information from CNN features with global context from the Transformer. As shown in Figure 6(b), TransUNet adopts a jumper-connected encoder/decoder design. The encoder consists of a CNN layer and several Transformer layers. First, the input image needs to be divided into small blocks and labeled. Then use CNN to generate feature maps for the input patch. CNN features of different resolutions are passed to the decoder through skip connections, thereby preserving spatial localization information. Then, patch embedding and position embedding are performed on the feature map sequence. The embedded sequence is sent into a series of Transformer layers to learn global relationships. Each Transformer layer consists of an MSA block (Dosovitskiy et al., 2020b; Vaswani et al., 2017) and a multilayer perceptron (MLP) block (Figure 6(a)). The hidden feature representation produced by the last Transformer layer is reconstructed and gradually upsampled by the decoder, thereby outputting the final segmentation mask. TransUNet shows superior performance to other competing methods such as attention U-Net on CT multi-organ segmentation tasks

In another study, Zhang et al. (2021) adopted a different approach to combine CNN and Transformer. Instead of first using CNN to extract underlying features and then passing the features through the Transformer layer, the TransFuse model combines the two branches of CNN and Transformer in a parallel manner. The Transformer branch consists of several layers, taking as input a sequence of embedded image patches to capture global contextual information. The output of the last layer is reshaped into a 2D feature map. To recover finer local details, the maps were sampled to higher resolution at three different scales. Correspondingly, the CNN branch uses three ResNet-based blocks to extract features at three different scales from local to global. Features of the two branches with the same resolution scale are selectively fused through an independent module. The fused features can simultaneously capture low-level spatial context and high-level global context. Finally, multi-layer fusion features are used to generate the final segmentation mask. TransFuse achieves good results in prostate MRI segmentation

In addition to 2D image segmentation, hybrid methods are also applicable to 3D scenes. Hatamizadeh et al. (2022) proposed an unet-based architecture for volume segmentation of MRI brain tumors and CT spleen. Similar to the 2D case, the 3D image is first divided into several blocks. Then linear embedding and positional embedding are performed on the input image volume sequence, and then fed to the encoder. The encoder consists of multiple Transformer layers to extract multi-scale global feature representations from the embedding sequence. Features extracted at different scales are upsampled to higher resolutions and then merged with the multi-scale features of the decoder via skip connections. In another study, Xie et al. (2021b) studied how to reduce the computational complexity and space complexity of transformer in 3D multi-organ segmentation tasks. To achieve this goal, they replaced the original MSA module in the ordinary Transformer with a deformable self-attention module (Zhu et al, 2021a). This attention module only focuses on a small set of key locations instead of treating all locations equally, thus greatly reducing complexity. Furthermore, their proposed architecture CoTr is in the same spirit as TransUNet—CNN generates feature maps that are used as input to Transformers. The difference is that the CNN in CoTr does not extract single-scale features, but multi-scale feature maps.

For the Transformer-only paradigm, Cao et al. (2021) proposed Swan-Unet, the first Unet-like pure Transformer architecture for medical image segmentation. swan-unet has a symmetric codec structure and does not use any convolution operations. The main components of the encoder and decoder are (1) Swin Transformer module (Liu et al., 2021) and (2) patch merging or expansion layers. The Swin Transformer module uses a shifted window scheme, which has better modeling capabilities and lower complexity in calculating self-attention. Therefore, the authors use it to extract feature representations of input sequences embedded in image patches. Subsequent patch layers downsample the feature representation/map to a lower resolution. These downsampled maps are further passed through several other Transformer blocks and patched merge layers. Similarly, the decoder also uses the Transformer block for feature extraction, but its patch expansion layer extends the sampled feature map to higher resolutions. Similar to U-Net, the upsampled feature maps are fused with the downsampled feature maps from the encoder via skip connections. Finally, the decoder outputs pixel-level segmentation predictions. This framework achieves satisfactory results in multi-organ CT and cardiac MRI segmentation tasks.

Note that to ensure good performance and reduce training time, most transformer-based segmentation models introduced so far are pre-trained on large external datasets such as ImageNet. Interestingly, research shows that Transformers can also integrate high-level information and finer details by leveraging computationally efficient self-attention modules (Wang et al., 2020a) and new training strategies (Valanarasu et al., 2021) , produces good results without pre-training. Furthermore, Hatamizadeh et al. (2022) and Xie et al. (2021b) found that pre-training did not show an improvement in performance when applying transformer-based models to 3D medical image segmentation.

Mask r-cnn for segmentation: In addition to the above UNet and transformer based methods, another architecture Mask RCNN (He et al, 2017) was originally developed for pixel-level instance segmentation and achieved good results in medical tasks. Effect. Since it is closely related to Faster RCNN (Ren et al., 2015, 2017), which is a region-based CNN for object detection, the details of Mask RCNN and its relationship to the detection architecture will be elaborated later . In short, Mask RCNN has (1) a Region Proposal Network (RPN) like Faster RCNN to produce high-quality region proposals (i.e. that may contain objects), (2) a RoIAlign layer to maintain the relationship between ROIs and their feature maps spatial correspondence, and (3) in addition to the same bounding box prediction as Faster RCNN, there is a parallel branch for binary mask prediction. It is worth noting that Feature Pyramid Network (FPN) (Lin et al., 2017a) is used as the backbone of Mask RCNN to extract multi-scale features. FPN has two approaches, bottom-up and top-down, to extract and merge features in the pyramid hierarchy. The bottom-up path extracts feature maps from high resolution (semantically weak features) to low resolution (semantically strong features), while the top-down path does the opposite. At each resolution, features generated by the top-down path are enhanced by features from the bottom-up path via skip connections. This design may make FPN look like U-Net, but the main difference is that FPN predicts independently at all resolution scales instead of one.

Wang et al. (2019b) proposed a volume attention (VA) module for 3D medical image segmentation within the Mask RCNN framework. This attention module can exploit the contextual relationship of the three-dimensional CT volume along the z direction. More specifically, the feature pyramid is extracted not only from the target image (3 adjacent slices with the target CT slice as the middle), but also from a series of adjacent images (also 3 CT slices). The target pyramid and the adjacent feature pyramid are then connected at each layer to form an intermediate pyramid that carries long-range relationships on the z-axis. Finally, spatial attention and channel attention are applied on the intermediate pyramid and target pyramid to form the final feature pyramid for mask prediction. Using this VA module, Mask RCNN can achieve lower false positives in segmentation. In another study, Zhou et al. (2019c) combined unnet++ and Mask RCNN to obtain Mask RCNN++. As mentioned before, UNet++ shows better segmentation results using redesigned nested and dense skip connections, so the authors use them to replace the ordinary skip connections of FPN inside masked RCNN. A large performance improvement can be observed using the proposed model.

3.2.2 Segmentation model based on unsupervised learning


For medical image segmentation, in order to alleviate the need for a large amount of annotated training data, researchers use generative models for image synthesis to increase the number of training samples Zhang et al. (2018b); Zhao et al. (2019a). In the meantime, harnessing the power of unlabeled medical images appears to be a more popular option. In contrast to difficult and expensive high-quality annotated datasets, unlabeled medical images are generally available, often in large quantities. Given a small medical image dataset with limited ground truth annotations and a related but unlabeled large dataset, researchers explored self-supervised and semi-supervised learning methods to learn useful and reproducible features from the unlabeled dataset. Transferred feature representations, which will be discussed in this section and the next section respectively.

Self-supervised pre-training tasks: Since self-supervision through pre-training tasks and contrastive learning can learn rich semantic representations from unlabeled datasets, self-supervised learning is often used to pre-train models when limited annotated examples are available When solving downstream tasks (such as medical image segmentation) more accurately and effectively, Taleb et al. (2020). Pre-training tasks can be designed according to application scenarios or selected from traditional tasks used in the field of computer vision. For the former type, Bai et al. (2019) designed a new pre-training task by predicting anatomical locations for cardiac MR image segmentation. The self-learned features from the pre-training task are transferred to a more challenging task, namely accurate ventricular segmentation. This method achieves higher segmentation accuracy than standard U-Net trained from scratch, especially when only limited annotations are available

For the latter type, Taleb et al. (2020) extended the execution of pretext tasks from 2D to 3D scenes and studied the effectiveness of several pretext tasks (e.g., rotation prediction, puzzle, relative patch localization) in 3D medical image segmentation . For brain tumor segmentation, they adopted the U-Net architecture and performed the pretext task on a large unlabeled dataset (~22,000 MRI scans) to pretrain the model; then on a smaller labeled dataset (285 MRI scans) ) to fine-tune the learned feature representation. The 3D pretext task performs better than the 2D pretext task; more importantly, the proposed method sometimes outperforms supervised pre-training, indicating that the self-learned features have good generalization ability.

The performance of self-supervised pre-training can also be improved by adding other types of information. Hu et al. (2020) implemented a context encoder (Pathak et al., 2016) to perform semantic rendering as an excuse task. They used DICOM metadata from ultrasound images as weak labels to improve the quality of pre-trained features and facilitate both different segmentation tasks.

Methods based on self-supervised contrastive learning: For this method, early studies, such as the work of Jamaludin et al. (2017), adopted raw contrastive loss (Chopra et al., 2005b) to learn useful feature representations. In the past three years, as interest in self-supervised contrastive learning has surged, contrastive loss has evolved from its initial version to a more powerful version (Oord et al., 2018) for learning expressive features from unlabeled datasets. Feature representation. Chaitanya et al. (2020) claimed that although the contrastive loss of Chen et al. (2020a) is suitable for learning image-level (global) feature representations, it does not guarantee learning unique local representations that are important for per-pixel segmentation. They proposed a local contrast loss method to capture local features, thereby providing complementary information to improve segmentation performance. Meanwhile, to the best of our knowledge, these authors first exploited the domain knowledge that structural similarities exist in volumetric medical images (e.g., CT and MRI) when calculating the global contrast loss. This method significantly outperforms other semi-supervised and self-supervised methods in low-annotation MR image segmentation. Furthermore, research shows that the proposed method can further benefit from data augmentation techniques such as Mixup (Zhang et al., 2018a).

3.2.3 Segmentation model based on semi-supervised learning


Semi-supervised consistency regularization: Commonly used is the average teacher model. Yu et al. (2019) introduced uncertainty estimation based on the average teacher framework (Kendall and Gal, 2017) to better segment the 3D left atrium from MR images. They argue that on an unlabeled dataset, the output of the teacher model can be noisy and unreliable; therefore, in addition to generating target outputs, the teacher model is also modified to estimate the uncertainty of these outputs. An uncertainty-aware teacher model can provide more reliable guidance to the student model, and the student model can in turn improve the teacher model. The average teacher model can also be improved by switching consistent strategies (Li et al., 2020b). In one study, Wang et al. (2020b) proposed a semi-supervised framework for segmenting COVID-19 pneumonia lesions from noisy labeled CT scans. Their framework is also based on a mean teacher model; instead of updating the teacher model using predefined values, they adaptively update the teacher model using a dynamic threshold of the student model segmentation loss. Likewise, the student model is also adaptively updated by the teacher model. In order to handle noisy labels and foreground-background imbalance simultaneously, the authors developed a generalized version of Dice loss. The authors designed the segmentation network in the same spirit as U-Net, but made some changes in new skip connections (Pang et al., 2019), multi-scale feature representation (Chen et al., 2018a), etc. Finally, the segmentation network with Dice loss is combined with the average teacher framework. This method is highly robust to labeling noise and has achieved good results in pneumonia lesion segmentation.

Semi-supervised pseudo-labeling: Fan et al. (2020) proposed a semi-supervised framework (Semi-InfNet) to solve the problem of lack of high-quality labeled data in CT image segmentation of COVID-19 lung infection. To generate pseudo labels for unlabeled images, they first used 50 labeled CT images to train their model, which generated pseudo labels for a small number of unlabeled images. The new pseudo-labeled examples are then incorporated into the original labeled training dataset, and the model is retrained to generate pseudo-labels for another batch of unlabeled images. This process is repeated until all 1600 unlabeled CT images are falsely labeled. Both labeled and pseudo-labeled examples are used to train Semi-InfNet, which significantly outperforms other cutting-edge segmentation models such as UNet++. In addition to the semi-supervised learning strategy, there are three key components in the model responsible for good performance: Parallel Partial Decoder (PPD) (Wu et al., 2019a), Reverse Attention (RA) (Chen et al., 2018b) and Edge Note (Zhang et al., 2019). PPD can aggregate high-level features of the input image and generate a global map indicating the approximate location of the lung infection area; the EA module leverages low-level features to model boundary details, and the RA module further refines the rough estimate into an accurate segmentation map.

Semi-supervised generative models: As one of the earliest works to extend generative models to semi-supervised segmentation tasks, Sedai et al. (2017) utilized two VAEs to segment optic cups from retinal fundus images. The first VAE learns feature embeddings from a large number of unlabeled images via image reconstruction; the second VAE is trained on a smaller number of labeled images, mapping input images to segmentation masks. In other words, the authors use the first VAE to perform an auxiliary task (image reconstruction) on unlabeled data, which can help the second VAE better achieve the target goal (image segmentation) using labeled data. To leverage the feature embeddings learned by the first VAE, the second VAE simultaneously reconstructs the segmentation mask and latent representation of the first VAE. Leveraging additional information from unlabeled images improves segmentation accuracy. In another study, Chen et al. (2019c) also adopted a similar idea and introduced auxiliary tasks on unlabeled data to perform image segmentation with limited labeled data. Specifically, the authors propose a semi-supervised segmentation framework that consists of an unet-like network for segmentation (target target) and an autoencoder for reconstruction (auxiliary task). Unlike previous studies that trained two VAEs separately, the segmentation network and reconstruction network under this framework share the same encoder. Another difference is that the foreground and background parts of the input image are reconstructed/generated separately and obtain their respective segmentation labels through the attention mechanism. This semi-supervised segmentation framework outperforms its peers (e.g., fully supervised CNN) in different labeled/unlabeled data segmentations.

In addition to the above methods, researchers have also explored incorporating domain-specific prior knowledge to customize semi-supervised frameworks to achieve better segmentation performance. Prior knowledge varies greatly, such as anatomical priors (He et al., 2019), atlas priors (Zheng et al., 2019), topological priors (Clough et al., 2020), and semantic constraints (Ganaye et al., 2018) and shape constraints (Li et al., 2020c), among others.

3.3 Detection


A natural image may contain objects belonging to different categories, and each object category may contain several instances. In the field of computer vision, object detection algorithms are used to detect and identify whether there are any instances of certain object categories in an image (Sermanet et al., 2014; Girshick et al., 2014; Russakovsky et al., 2015). Previous works (Shen et al., 2017; Litjens et al., 2017) reviewed the successful applications of frameworks before 2015, such as OverFeat (Sermanet et al., 2014; Ciompi et al., 2015), RCNN (Girshick et al., 2014 years) and models based on fully convolutional networks (FCN) (Long et al., 2015; Dou et al., 2016; Wolterink et al., 2016). As a comparison, we aim to summarize the applications of recent object detection frameworks (since 2015), such as Faster RCNN (Ren et al., 2015), YOLO (Redmon et al., 2016) and RetinaNet (Lin et al., 2017b). In this section, we will first briefly review several recent milestone detection frameworks, including single-stage and two-stage detectors . It is worth noting that since these detection frameworks are often used in supervised and semi-supervised settings, we introduce them under these learning paradigms. We then describe the application of these frameworks to specific types of lesion detection and general lesion detection . Finally, we will introduce unsupervised lesion detection based on GANandVAE .

3.3.1 Supervised and semi-supervised lesion detection


Detection framework overview: The RCNN framework (Girshick et al., 2014) is a multi-stage pipeline. Although RCNN achieves impressive results in object detection, it also has some disadvantages, namely, the multi-stage pipeline makes training slow and difficult to optimize; extracting the proposed features for each region individually makes training expensive in terms of disk space and time. It is expensive and also slows down testing (Girshick, 2015). These deficiencies inspired several recent milestone detectors, which can be divided into two groups (Liu et al, 2020b): (1) Two-stage detection frameworks (Girshick, 2015; Ren et al., 2015, 2017; Dai et al., 2016 ), which includes a separate module for generating region proposals (predicted class probabilities and bounding box coordinates) before bounding box identification; (2) a single-stage detection framework (Redmon et al., 2016; Redmon and Farhadi, 2017; Liu et al., 2016; Lin et al., 2017b; Law and Deng, 2020; Duan et al., 2019), predict bounding boxes in a unified manner without separating the process of generating region proposals. In images, region proposals are collections of potential regions or candidate bounding boxes that may contain objects (Liu et al., 2020b).

Two-stage detector : Unlike RCNN, the Fast RCNN framework (Girshick, 2015) is an end-to-end detection pipeline that employs a multi-task loss to jointly classify region proposals and regress bounding boxes. Region proposals in Fast RCNN are generated on shared convolutional feature maps rather than on the original image, thus speeding up computation. A region of interest pooling layer is then applied to warp all region proposals to the same size. These adjustments make the detection performance of fast RCNN better and faster, but the speed is still bottlenecked by the inefficiency of computational region proposal processing. In the Faster RCNN framework (Ren et al, 2015, 2017), the Region Proposal Network (RPN) replaces the selective search method and effectively generates high-quality region proposals from anchor boxes. Anchor boxes are a set of predetermined candidate boxes with different sizes and aspect ratios designed to capture specific categories of objects (Ren et al, 2015). Since then, anchor boxes have played a leading role in top detection frameworks. Mask RCNN (He et al, 2017) is closely related to Faster RCNN, but it was originally designed for pixel-level object instance segmentation. Mask RCNN also has an RPN to propose candidate object bounding boxes; this new framework extends Faster RCNN by adding an additional branch that outputs a binary object mask to the existing branch that predicts classes and bounding box offsets. RCNN uses Feature Pyramid Network (FPN) (Lin et al., 2017a) as the backbone to extract features at various resolution scales. In addition to instance segmentation, Mask RCNN can also be used for object detection with good accuracy and speed.

primary detectorRedmon et al. (2016) proposed YOLO, a single-stage framework; instead of using a separate network to generate region proposals, they treated object detection as a simple regression problem. Directly predict object categories and bounding box coordinates using a single network. YOLO also differs from region proposal-based frameworks (e.g., Faster CNN) in that it learns features globally from the entire image rather than from local regions. Although faster and simpler, YOLO has more positioning errors and lower detection accuracy than faster RCNN. Later, the authors proposed YOLOv2 and YOLO9000 (Redmon and Farhadi, 2017) to improve performance by integrating different techniques, including batch normalization, using good anchor boxes, fine-grained features, multi-scale training, etc. Lin et al. (2017b) believe that the main reason for the lagging performance of single-stage detectors is the imbalance between foreground and background classes (i.e., the training process is dominated by a large number of simple examples from the background). To solve the class imbalance problem, they proposed a new focus loss that can weaken the influence of simple examples and enhance the contribution of hard examples. The proposed framework (RetinaNet) shows higher detection accuracy than state-of-the-art two-stage detectors at the time. Law and Deng (2020) proposed CornerNet and pointed out that the common use of anchor boxes in object detection frameworks (especially single-stage detectors) can lead to problems such as extreme imbalance between positive and negative examples, slow training, Problems such as introducing additional hyperparameters. Instead of designing a set of anchor boxes to detect bounding boxes, the authors define bounding box detection as detecting a pair of key points (top left corner and bottom right corner) (Newell et al., 2017; Tychsen-Smith and Petersson, 2017). Nonetheless, CornerNet generates a large number of incorrect bounding boxes due to the inability to fully utilize the identifiable information within the cropped region (Duan et al, 2019). Duan et al. (2019) proposed CenterNet based on CornerNet, which uses a keypoint triplet (including a pair of corner points and a center keypoint) to detect each object. Unlike CornerNet, CenterNet can extract more identifiable visual patterns in each proposed region, thereby effectively suppressing inaccurate bounding boxes (Duan et al, 2019).

Detection of specific types of medical objects (e.g. lesions) : Common computer-aided detection (CADe) tasks include detection of lung nodules (Gu et al., 2018; Xie et al., 2019b), breast masses (Akselrod Ballin et al., 2017 ; Ribli et al., 2018), lymph nodes (Zhu et al., 2020b), sclerosis lesions (Nair et al., 2020), etc. The general detection framework was originally designed for general object detection in natural images, but it cannot guarantee satisfactory performance for lesion detection in medical images, mainly for two reasons: (1) The size of lesions may be very small compared with natural objects ;(2) Lesions and non-lesions often have similar appearance (e.g. texture and intensity) (Tao et al, 2019; Tang et al, 2019). In order to provide good detection performance in the medical domain, these frameworks need to be tuned through different methods, such as incorporating domain-specific features , uncertainty estimation , or semi-supervised learning strategies, as shown below.

In the fields of radiology and histopathology, combining domain-specific characteristicshas always been a popular choice. In the field of radiology, the three-dimensional spatial context information inherent in volumetric images (such as CT scans) has been exploited in many studies (Roth et al., 2016; Dou et al., 2017; Yan et al., 2018a; Liao et al., 2019) . For example, in the pulmonary nodule detection task, Ding et al. (2017) believed that the original Faster RCNN (Ren et al., 2015) with the VGG-16 network (Liu and Deng, 2015) as the backbone could not capture the representation of small pulmonary nodules. features; they introduced a deconvolution layer at the end of Faster RCNN to recover fine-grained features that are important in detecting small objects. On the deconvolution feature map, FPN is applied to propose nodule candidate regions from two-dimensional axial slices. To reduce the false positive rate, the authors propose to make the classification network see the full context of candidate nodules. Instead of using 2D CNN, they chose 3D CNN to exploit the 3D context of the candidate region, so that more unique features can be captured for nodule recognition. This method ranks first in nodule detection on the LUNA16 benchmark dataset (Setio et al., 2017). Zhu et al. (2018) also considered the 3D characteristics of lung CT images and designed 3D Faster RCNN for nodule detection. To effectively learn nodule features, 3D faster RCNN has a unet-like structure (Ronneberger et al., 2015) and is built using compact dual-path blocks (Chen et al., 2017). It is worth noting that although 3D CNN is effective in improving detection performance, 3D CNN also has disadvantages compared with 2D CNN, including consuming more computing resources and requiring more effort to obtain 3D bounding box annotations (Yan et al. ., 2018a; Tao et al., 2019). In a recent study, Mei et al. (2021) established a large dataset (PN9) containing more than 40,000 annotated lung nodules for training CNN-based 3D models. The authors improved the model's ability to detect large and small pulmonary nodules by exploiting the correlation that exists between multiple consecutive CT slices. Given a slice group, adopt a module based on non-local operations (Wang et al. , 2018) to obtain long-range dependencies at different positions and different channels in the feature map. Furthermore, since each shallow ResNet block can generate feature maps carrying useful spatial information at the same scale, the authors reduce false positive nodule candidates by merging multi-scale features produced by 3 different blocks.

In the field of histopathology, Rijthoven et al. (2018) proposed an improved version of YOLOv2 (Redmon and Farhadi, 2017) for lymphocyte detection in whole-slice images (WSI). Based on prior knowledge of lymphocytes (e.g., average size, no overlap), the authors simplified the original 23-layer YOLO network by retaining only a few layers. Since it is known in advance that brown areas without lymphocytes in WSI contain many hard negative samples, the authors also designed a sampling strategy to force the detection model to focus on these hard negative samples during training. This method improves F1-score by 3% and is 4.3 times faster. In their later work, SwiderskaChadaj et al. (2019) modified the YOLO architecture to further detect lymphocytes in a more diverse breast, prostate, and colon cancer WSI dataset; however, its performance was not as good as that based on U-Net detection architecture, which first classifies each pixel and then uses post-processing techniques to produce detection results. The improved YOLO structure is the least robust to different staining techniques.

Recently, semi-supervised methods have been used to improve the performance of medical object detection (Gao et al., 2020; Qi et al., 2020). For example, Wang et al. (2020c) developed a generalized version of the original focus loss (Lin et al., 2017b) to handle soft labels in computing semi-supervised loss functions. They modified the semi-supervised method MixMatch (Berthelot et al., 2019) from two aspects to make it suitable for 3D medical image detection. FPN is first applied to unlabeled CT images (without lesion annotations) to generate pseudo-labeled object instances. The pseudo-labeled examples are then mixed with examples with ground-truth annotations via Mixup augmentation. However, the original Mixup augmentation (Zhang et al., 2018a) was designed for classification tasks, where labels are image categories; the authors applied this augmentation technique to the task of lesion detection with annotations in the form of bounding boxes. In lung nodule detection, semi-supervised learning methods show significant performance gains over supervised learning baselines.

Furthermore, uncertainty estimation is another useful technique that helps detect small objects (Ozdemir et al., 2017; Nair et al., 2020). For example, in the multiple sclerosis lesion detection task, uncertainty mainly comes from small lesions and lesion boundaries, and Nair et al. (2020) explored the use of uncertainty estimation to improve detection performance. Specifically, four uncertainty measures are calculated: prediction variance of training data (Kendall and Gal, 2017), variance of Monte Carlo (MC) samples, prediction entropy, and mutual information. The threshold formed by these measures is used to filter out the most uncertain candidate lesions, thereby improving detection performance.

Universal lesion detection : Traditional lesion detectors focus on specific types of lesions, but there is growing research interest in identifying and localizing different types of lesions from the entire human body at once (Yan et al., 2018a, 2019; Tao et al. , 2019; Yan et al., 2020; Cai et al., 2020; Li et al., 2020d). DeepLesionIt is a large and comprehensive data set (32 K lesions), including various lesion types such as pulmonary nodules, liver tumors, abdominal masses, and pelvic masses.(Yan et al., 2018b, 2018c). Tang et al. (2019) proposed a ULDor universal lesion detection method based on Mask RCNN. Mask RCNN requires ground-truth masked lesions; however, the DeepLesion dataset does not contain such annotation masks. Using RECIST (Response Evaluation Criteria In Solid Tumors) annotations (Eisenhauer et al, 2009), the authors estimate the true mask by fitting an ellipse to each lesion area. Additionally, the model is retrained using hard negative examples to reduce false positives. Yan et al. (2019) further improved the performance of general lesion detection by forcing the use of a multi-task detector (MULAN) for joint lesion detection, labeling and segmentation. Previous studies have shown that combining different tasks can provide complementary information to each other, thereby improving the performance of a single task (Wu et al., 2018b; Tang et al., 2019). MULAN is improved from Mask RCNN (He et, 2017) and has three head branches. The detection branch predicts whether each proposed region is a lesion and performs regression on the bounding box; the labeling branch predicts 185 labels for each lesion proposal (such as body part, lesion type, intensity, shape, etc.); the segmentation branch predicts 185 labels for each proposal The region outputs a binary mask (lesion/non-lesion). MULAN significantly outperforms previous lesion detection models such as ULDor (Tang et al., 2019) and 3DCE (Yan et al., 2018a). Furthermore, Yan et al. (2020) recently showed that learning from heterogeneous lesion datasets and partial labels can also improve detection performance.

In addition to the above strategies, attention mechanism is another effective way to improve lesion detection. Tao et al. (2019) trained a universal lesion detector on the DeepLesion dataset, as well as an attention mechanism (Wang et al., 2017; introduced in Woo et al., 2018) to integrate 3D contextual and spatial information into R-FCN-based detection architecture (Dai et al., 2016). The contextual attention module outputs a vector representing the importance of features learned from different axial CT slices, so the detection framework can adaptively aggregate features from different slices (i.e., enhance relevant contextual features); the spatial attention module outputs a The weight matrix can enlarge the discriminant area on the feature map, so that richer and more representative features of small lesions can be well learned. Despite using much fewer slices, the proposed method shows significant performance improvements. Li et al. (2019) proposed an FPN-based architecture whose attention module can integrate clinical knowledge. In clinical practice, it is common for radiologists to examine multiple CT windows to accurately diagnose lesions. The authors first use three FPNs to generate feature maps from three frequently checked windows; then use an attention module (Woo et al., 2018) to reweight the feature maps from different windows. To further improve performance, prior knowledge of lesion location is also incorporated.

We observe that two-stage detectors are still quite common due to their high performance and robustness, both in specific types of lesions and in general lesion detection; however, making region proposals alone may hinder the development of streamlined CADe plans. Several recent studies have shown that single-stage detectors can also achieve good detection performance (Pisov et al., 2020; Lung et al., 2021; Zhu et al., 2021b). We predict that if an advanced anchor-free single-stage detector (such as CenterNet (Duan et al, 2019)) is properly tuned to adapt to the uniqueness of medical images, it will attract more attention in the long run and even become more powerful than the two Level detectors are a better choice for developing new CADe solutions.

3.3.2. Unsupervised lesion detection (non-prespecified lesion detection type)


As mentioned in the previous subsection, whether type-specific or general lesion detection, a certain amount of supervision is required to train a primary or secondary detector. To establish supervision, the lesion type needs to be specified in advance before training the detector. Once trained, the detector cannot detect lesions that are not included in the training data set. In contrast, unsupervised lesion detection does not require ground-truth annotations and therefore does not require the lesion type to be specified in advance. Unsupervised detection has the potential to detect any type of lesion (Baur et al, 2021), but its performance is not comparable to fully supervised/semi-supervised methods. Nonetheless, it can be used to establish rough detection of suspicious areas and provide imaging biomarker candidates.

To avoid potential confusion, we make the following two clarifications. First, the methods presented in this subsection originate from “unsupervised anomaly detection” because in medical images it is natural to treat lesions such as brain tumors as an anomaly. The term "anomaly detection" will be used frequently in this context. Second, it is important to note that in the literature, “anomaly detection” often appears together with another term “anomaly segmentation” (Baur et al, 2021). This is because they are essentially two closely linked tasks - once an abnormal region is detected in an image, a segmentation map is obtained by applying a binarization threshold to the detection map. In other words, what works in one direction usually works in another direction, so readers will see the term "abnormal segmentation."

The core assumption of unsupervised anomaly detection is that the unsupervised model can capture the underlying distribution of normal parts of the image (such as healthy tissue and anatomical structures), but abnormal parts such as tumors deviate from the canonical distribution, so these anomalies can be detected . Commonly used models for estimating canonical distributions are mainly derived from the concepts of VAE and GAN. The success of these unsupervised models is mostly seen in MRI. Notably, Baur et al. (2021) reviewed various autoencoder-based anomaly segmentation methods in brain MR images. The authors conduct a thorough comparison of these models and provide many interesting insights into successful applications. An important conclusion drawn from this paper is that recovery-based methods generally perform better than reconstruction-based methods without considering running time. In contrast to this comprehensive review paper, we will briefly introduce reconstruction-based methods and focus on recent work related to recovery-based detection.

In the reconstruction-based paradigm, an AE or AE-based model projects the image into a low-dimensional latent space and then reconstructs the original image based on its latent representation. Only healthy images are used for training, and the model is optimized to produce low pixel-by-pixel reconstruction error. When unhealthy images are passed through the model, the reconstruction error is lower for normal areas and higher for abnormal areas. Uzunova et al. (2019) used CVAE to learn latent representations of healthy image plaques. In addition to reconstruction errors, they further hypothesized a large distance between underlying representations of healthy and unhealthy plaques. Combining these two distances, the CVAE-based model gave reasonable segmentation results on tumor MRI. It is worth noting that the authors used the relative position of patches as conditions to incorporate the local environment into CVAE. Location-dependent conditions can provide additional prior information on healthy and unhealthy tissue to improve performance

In recovery-based paradigms, the target to be recovered is either (1) the optimal latent representation or (2) the healthy counterpart of the input abnormal image. Both GAN-based and VAE-based methods have been applied, but GANs are usually used for the first type of latent representation recovery. While the GAN's generator can easily map latent vectors back to images, it lacks the ability to do the inverse mapping, i.e. from images to latent space, which is important in computing anomaly scores. This is a key issue addressed by many works applying GANs to anomaly detection. As a seminal work, Schlegl et al. (2017) proposed the so-called AnoGAN to obtain the inverse mapping. The authors first used healthy images to pretrain the GAN (generator and discriminator) to learn the canonical distribution and maintain the weights of the model. fixed. Then given an input image (normal or abnormal), gradient descent is performed in the latent space (with respect to the latent variables) to recover the corresponding optimal latent representation. More specifically, the optimization is guided by two combined losses, namely residual loss and discriminative loss. The residual loss, like the reconstruction error mentioned earlier, measures the pixel-by-pixel dissimilarity between the real input image and the image produced by the generator from the latent variables. At the same time, these two types of images are fed into the discriminator network and an intermediate layer is used to perform feature extraction on them. Computes the difference between intermediate feature representations, resulting in recognition loss. Finally, after optimizing the latent variables, the authors use these two losses to calculate an anomaly score, indicating whether the input image contains an abnormal region. AnoGAN provides good performance, but iterative optimization is time-consuming. In their follow-up work, Schlegl et al. (2019) proposed a more efficient fAnoGAN model by introducing an additional encoder that can perform a fast inverse mapping from image space to latent space. Similar to developing AnoGAN, they first pre-trained WGAN using healthy images and again kept the model’s weights fixed. The generator with fixed weights is then used as the decoder of the acoustic emission without further training, while the combination of the two loss functions introduced in AnoGAN is used to train the encoder of the acoustic emission. After being fully trained, the encoder network can efficiently map images to the latent space with only a single forward pass. Slightly earlier than f-AnoGAN, Baur et al (2018) proposed the so-called AnoVAEGAN, which combines VAE and GAN for fast inverse mapping. In this framework, the generator of GAN and the decoder of VAE are the same network, and the encoder of VAE is used to learn the inverse mapping. Therefore, three components, encoder, decoder and discriminator, need to be trained. The loss function here is different from AnoGAN and f-AnoGAN, but there is still a reconstruction error. Furthermore, compared with these two patch-based models, AnoVAEGAN directly takes the entire MR image as input, allowing to capture and exploit global context that may be valuable for anomaly segmentation.

For the second type, restoring a healthy copy of the input image means that if the input contains abnormal regions, they are expected to be removed in the restored version, while leaving the remaining normal regions. Therefore, a pixel-level dissimilarity map between the input image and the recovered image can be obtained, and anomalies can be detected. Successful recovery often relies on maximum a posteriori estimation (MAP). Specifically, the posterior that is maximized consists of the canonical distribution of healthy images and the data consistency term (Chen et al, 2020d). Canonical distributions can be modeled by VAE or its variants, whose training is guided by ELBO, which is an estimate of the original objective function of the VAE (Kingma and Welling, 2014). As for the data consistency term, it controls how similar the recovered image is to the input image. In the task of detecting brain tumors from MR images, You et al. (2019) first used GMVAE to capture the distribution of lesion-free MR images and adopted the total variation norm for data consistency regularization. These two elements then jointly guide the optimization in MAP estimation, thereby iteratively recovering healthy correspondences for anomalous inputs. Recently, Chen et al. (2021c) claimed in their follow-up work that ELBO may not be a good approximation of the original loss function of VAE. Therefore, this inaccurate loss may lead to learning inaccurate canonical distributions, causing the gradient calculation in iterative optimization to deviate from the true direction. In order to solve this problem, the author proposes to replace the gradient of ELBO with the derivative of the local Gaussian distribution. When detecting glioblastoma and glioma in MR images, this method achieves higher accuracy at a low false-positive rate compared with other methods. In addition, unlike most previous works that rely on 2D MR slices, the authors incorporate 3D information into the training of VAE to further improve performance.

3.4 Registration


Registration is the process of aligning two or more images into a coordinate system with matching content and is an important step in many medical image analysis tasks. Image registration can be divided into two groups: rigid and deformable (non-rigid). In rigid registration, all image pixels undergo a simple transformation (such as rotation) uniformly, while deformable registration aims to establish a non-uniform mapping between images. In recent years, there have been an increasing number of deep learning applications related to this research topic, especially in deformable image registration . Similar to the organizational structure of the review article (Haskins et al, 2020), the deep learning-based medical image registration methods in our survey are divided into three groups: (1) deep iterative registration; (2) supervised registration; (3) Unsupervised registration. Interested readers can refer to several other excellent review papers (Fu et al., 2020; Ma et al., 2021b) for more comprehensive registration methods.

3.4.1 Depth iterative registration


In deep iterative registration, a deep learning model learns a measure that quantifies the similarity between a target/moving image and a reference/fixed image; then, the learned similarity measure is used in conjunction with a traditional optimizer to iteratively update the classic (i.e., non- Registration parameters of the learning-based) transformation framework. For example, Simonovsky et al. (2016) used a 5-layer CNN to learn metrics to evaluate the similarity between aligned 3D brain MRI T1-T2 image pairs, and then incorporated the learned metrics into a continuous optimization framework to complete deformable registration. This deep learning-based measure outperforms manually defined similarity measures such as mutual information for multimodal registration (Simonovsky et al., 2016). In essence, this work is most related to the previous method of Cheng et al. (2018), which uses FCN pre-training with stacked denoising autoencoders to estimate the similarity of 2D CT-MR patch pairs; both The main differences between these works are network architecture (CNN vs. FCN), application scenarios (3D vs. 2D) and training strategies (from scratch vs. pre-training). For T1-T2 weighted MR images and CT-MR images, Haskins et al. (2019) claimed that learning a good similarity measure is relatively easy because these multi-modal images share large similar views or simple intensity maps. They extended the deep similarity metric to a more challenging scenario, 3D MR-TRUS prostate image registration, where there are large appearance differences between the two imaging modalities.

In summary, Deep Similarity avoids the need to manually define similarity measures and is useful for establishing pixel-to-pixel and voxel-to-voxel correspondences. Deep similarity remains an important research direction, and it is often mentioned interchangeably with several other terms such as “metric learning” and “descriptor learning” (Ma et al, 2021b). Note that methods related to reinforcement learning can also be used to implicitly quantify image similarity, but we will not expand on this topic as reinforcement learning is beyond the scope of this article. In contrast, more advanced deep similarity-based methods (e.g., adversarial similarity) will be reviewed in the unsupervised registration subsection.

3.4.2 Supervised registration


Despite the success of deep iterative registration, the process of learning similarity measures and then iterative optimization in classical registration frameworks is too slow for real-time registration. In contrast, some supervised registration methods directly predict the deformation field/transformation in just one step without the need for iterative optimization. These methods typically require ground-truth distortion/deformation fields, which can be synthesized/simulated (Uzunova et al., 2017), manually annotated, or obtained via classical registration frameworks. For 3D deformable image registration, Sokooti et al. (2017) developed a multi-scale CNN-based model to directly predict the displacement vector field (dvf) between image pairs. In order to make their training data set larger and more diverse, they first artificially generated DVFs with different spatial frequencies and amplitudes, and then performed data augmentation on the generated DVFs, resulting in approximately 1 million training examples. The trained deformation images are registered in one go, and the performance is close to that of the traditional b-spline registration method.

In addition to supervision from ground-truth deformation fields, image similarity measures are sometimes incorporated to provide additional guidance for more accurate registration. This combination is called "dual supervision." In a recent study, Fan et al. (2019a) developed a dual-supervised and dual-guided brain MR image registration training strategy. Guided by the ground truth, the difference between the ground truth field and the predicted deformation field was calculated. Under the guidance of image similarity, the predicted deformation field is used to calculate the difference between the template image and the deformed image. The former guidance makes the network converge faster, and the latter guidance further refines the training and obtains more accurate registration results.

3.4.3 Unsupervised registration


Registration based on unsupervised learning has received widespread attention in recent years (Zhao et al., 2019b; Kim et al., 2019), mainly for two reasons: (1) Obtaining the ground truth distortion field through traditional registration methods is troublesome ;(2) The types of deformations used for model training are limited, resulting in unsatisfactory performance on unseen images. As one of the early works related to unsupervised registration, Wu et al. (2016) argued that supervised learning-based registration methods cannot generalize well to new data; they adopted convolutional stacked autoencoders (Lee et al., 2011) Feature extraction from stationary and moving images to improve registration performance

Balakrishnan et al. (2018) proposed an unsupervised registration model (VoxelMorph in Figure 7) that does not require supervised information (e.g., ground-truth registration fields or anatomical landmarks). The model consists of two parts: convolutional U-Net and spatial transformer network (STN). The authors treated 3D MR brain volume registration as a parameter function and modeled it using the U-Net architecture. The input to the encoder is the splicing of moving images and fixed images, and the decoder outputs the registration field. Using a spatial transformation network (Jaderberg et al., 2015), the learned registration domain is used to warp the moving image to obtain a reconstructed version of the fixed image. By minimizing the difference between the reconstructed image and the fixed image, VoxelMorph can update parameters to generate the desired deformation field. This unsupervised registration framework is able to run orders of magnitude faster yet performs competitively with the classic registration algorithm Symmetry Normalization (SyN) (Avants et al., 2008). In a later paper (Balakrishnan et al, 2019), the authors extended VoxelMorph to utilize auxiliary segmentation information (anatomical segmentation maps), and the extended model showed higher registration accuracy. Previously, some studies have shown that accurate cross-modal registration can be achieved using only auxiliary anatomical information without the ground truth of voxel-level transformations (Hu et al., 2018c, 2018d). Note that including segmentation information of corresponding anatomical structures is often referred to as "weakly supervised registration".

DLIR is another well-known unsupervised registration framework (de Vos et al., 2019), which is an extension of previous work (de Vos et al., 2017). DLIR has four stages that perform image registration step by step. The first stage is designed for affine image registration (AIR), and the remaining three stages are used for deformable image registration (DIR). In the AIR stage, CNN takes fixed images and moving images as input pairs and outputs predictions of affine transformation parameters, thereby obtaining affine aligned image pairs. In the subsequent DIR stage, these aligned image pairs serve as input to a new CNN, whose output is a b-spline displacement vector as the deformation field. This field can be used to obtain deformably registered image pairs, and the registration results can be further refined through the remaining two DIR stages.

The above-mentioned unsupervised registration frameworks all use manually defined similarity indicators and certain regularization terms to design loss functions. For example, VoxelMorph's loss function includes a similarity measure (mean squared error, cross-correlation (Avants et al., 2008)) to quantify the voxel correspondence between warped and fixed images, and a regularization term, with For controlling the spatial smoothness of distorted images (Balakrishnan et al., 2019). Although classical similarity measures are effective in single-modal registration, their success rate is lower than deep similarity measures in most multi-modal registrations. To this end, an advanced deep similarity metric learned under an unsupervised mechanism is proposed to obtain better multi-modal registration results. A notable example is the adversarial similarity proposed by Fan et al. (2019b). Specifically, the authors propose an unsupervised adversarial network with a UNET-based generator and a CNN-based discriminator. The generator accepts two input image quantities (moving image and fixed image) and outputs a deformation field, while the discriminator works by comparing a negative image pair (fixed image and moving image warped using the predicted field) with a positive image pair (fixed image and reference images) to determine whether they are well-registered. Using feedback from the discriminator to improve itself, the generator is trained to generate the most accurate deformations possible to fool the discriminator. This unsupervised adversarial similarity network achieved satisfactory results in single-modal brain MRI image registration and multi-modal pelvic image registration.

4 Discussion

4.1 Better combine deep learning and medical image analysis

4.1.1 From a task-specific perspective


Advances in medical image analysis using deep learning follow a lagging but similar timeline to computer vision. However, due to the differences between medical images and natural images, directly using computer vision methods may not produce satisfactory results. To achieve good performance, challenges unique to medical imaging tasks need to be addressed. For classification tasks , the key to success lies in extracting highly discriminative features relative to certain categories . This is relatively easy for domains with large differences between classes (e.g., accuracy on many public chest x-ray datasets often exceeds 90%), but can be difficult for domains with high similarity between classes . For example, the performance of mammogram classification is generally not very good (e.g., 70 ~ 80% accuracy is typically seen on private datasets) because it is difficult to perform in the presence of overlapping heterogeneous fibroglandular tissue Capture the distinguishing features of breast tumors (Geras et al., 2019). The concept of fine-grained visual classification ( FGVC ) (Yang et al., 2018) aims to identify subtle differences between visually similar objects and may be suitable for learning unique features with high inter-class similarity . However, it should be noted that we deliberately collected the benchmark FVGC dataset so that all image samples consistently exhibit high inter-class similarity. Therefore, methods developed and evaluated on these datasets may not be easily applicable to medical datasets, where only a portion but not all images exhibit a high degree of inter-class similarity. Nonetheless, we believe that the FVGC method, if appropriately modified, will be valuable for learning feature representations with high discriminative power in medical image classification. Other possible ways to enhance feature recognition capabilities include the use of attention modules, local and global features, domain knowledge, etc.

Classification angle

It can be seen from the process of bounding box prediction that medical target detection is more complicated than classification. Of course, detection faces challenges inherent in classification . At the same time, there are additional challenges , especially the imbalance in detection and classification of small-scale objects (such as small lung nodules) . One-stage detectors generally perform equally well as two-stage detectors when detecting large objects, but have a more difficult time detecting small objects . Existing research has shown that using multi-scale features in single- and two-stage detectors can greatly alleviate this problem. A simple and effective method is image pyramid (Liu et al., 2020b), which independently extracts features from multiple scales of the same image. This approach can help scale up small objects for better performance, but is computationally expensive and slow. But it is suitable for medical detection tasks that do not require high speed. Another useful but faster method is feature pyramid , which utilizes multi-scale feature maps from different convolutional layers. Although there are various ways to build feature pyramids, the rule of thumb is that it is necessary to fuse high-resolution feature maps with strong, high-level semantics . This plays an important role in detecting small objects, as shown in FPN (Lin et al, 2017a).

As the convolutional neural network goes from shallow to deep, the semantic information becomes more and more abundant, but the feature maps are getting smaller and smaller, and the resolution is getting lower and lower. The solution is to connect the shallow and high-level feature maps to combine the shallow information. Passed to deep layers to solve the problem that deep feature maps tend to ignore small targets.

Class imbalance occurs because the detector needs to evaluate a large number of candidate regions, but only a few contain the object of interest. In other words, the class balance is heavily biased towards negative examples (e.g., background regions), most of which are simply negative examples. The presence of a large number of easy negatives can overwhelm the training process, leading to poor detection results. Two-stage detectors can handle this class imbalance problem better than single-stage detectors because most negative proposals are filtered out in the region proposal stage . As far as first-level detectors are concerned, recent research shows that abandoning the dominant use of anchor boxes can alleviate class imbalance to a large extent (Duan et al, 2019). However, most medical object detection methods are still anchor-based. anchor-freeIn the near future, we expect to see more one-stage detectorsexploration in medical object detection.

Detection angle

Medical image segmentation combines classification and detection challenges. Just like detection, class imbalance is a common problem in 2D and 3D medical segmentation tasks. Another similar challenge is segmenting small lesions (such as multiple sclerosis on MRI) and organs (such as the pancreas on abdominal CT scan). Furthermore, these two challenges are often intertwined. Segmentation performance is evaluated by adjusting metrics/losses, such as Dice coefficient (Milletari et al., 2016), generalized Dice (Sudre et al., 2017), ensemble of focal losses (Abraham and Khan, 2019), etc. These problems are to a large extent was relieved. However, these metrics are region-based (i.e., segmentation error is calculated in pixels). This can lead to the loss of valuable information about structure, shape and contour, which is important for later diagnosis/prognosis. Therefore, we believe that it is necessary to develop non-region-based metrics to provide complementary information to region-based metrics to obtain better segmentation performance. There are currently few studies in this area (Kervadec et al, 2019). We hope to see more in the future.

Furthermore, strategies such as combining local and global context, attention mechanisms, multi-scale features, and anatomical cues often help improve segmentation accuracy for both large and small objects. Here, we would like to highlight Transformersthe great potential of , due to their powerful ability to model long-range dependencies. Although long-range dependencies help achieve accurate segmentation, most CNN-based methods do not explicitly focus on this aspect. There are roughly two types of dependencies, namely intra-slice dependencies (pixel relationships within CT or MRI slices) and inter-slice dependencies (pixel relationships between CT or MRI slices) (Li et al, 2020e). Recent studies have shown transformerthat methods based on are powerful in both cases (Chen et al., 2021b; Valanarasu et al., 2021). The application of vision Transformersin medical image segmentation, especially 3D image segmentation, is still in its infancy, and more experimental work is likely to appear soon.

Split angle

Medical image registration is significantly different from previous tasks, as it aims to find pixel or voxel correspondence between two images . A unique challenge is the difficulty in obtaining reliable ground-truth registrations, which are either generated synthetically or by traditional registration algorithms. Unsupervised methods show great promise in solving this problem. However, many unsupervised registration frameworks (e.g. de Vos et al., 2019) consist of multiple stages to register images in a coarse-to-fine manner. Despite good performance, multi-stage frameworks increase computational complexity and make training difficult. It is better to develop a registration framework that has as few stages as possible and can be trained end-to-end.

Image registration angle

4.1.2 On the perspectives of different learning paradigms


Although deep learning has achieved great success in different tasks in the context of radiological image analysis, further performance improvements have been mainly hindered by the need for large annotated datasets . Supervised transfer learning can greatly alleviate this problem by initializing the weights of the model (for the target task) using the weights of the model pre-trained on relevant/irrelevant datasets (e.g. ImageNet). In addition to the widely used transfer learning , there are two possible directions: (1) using GAN models to expand labeled data sets ; (2) using self-supervised and semi-supervised learning models to mine information from large amounts of unlabeled medical images

GANs show great promise in medical image synthesis and semi-supervised learning; but one challenge is how to establish strong connections between the GAN's generator and target tasks (e.g., classifier, detector, segmenter). The lack of this connection may lead to subtle performance improvements compared to traditional data augmentation (e.g., rotation, rescaling, and flipping) (Wu et al., 2018a). The connection between generator and classifier can be strengthened by using semi-supervised GANs, where the discriminator is modified into a classifier (Salimans et al., 2016). Several training strategies can also be employed: identifying a “bad” generator that can significantly contribute to good semi-supervised classification (Dai et al., 2017); jointly optimizing the triplet components of generator, discriminator and classifier (Li et al. , 2017). It makes sense to explore new methods to effectively establish the connection between the generator and specific medical imaging tasks to obtain better performance. Furthermore, GANs typically require at least thousands of training examples to converge, which limits its applicability on small medical datasets. This challenge can be partially addressed by classical data augmentation using adversarial learning (FridAdar et al., 2018a, 2018b). Furthermore, if there is a relatively large number of medical images with structural, textural, and semantic similarities to the target dataset, pretraining the generator and/or discriminator may contribute to faster convergence and better performance (Rubin et al. , 2019). At the same time, some recent new enhancement mechanisms, such as differentiable enhancement (Zhao et al., 2020) and adaptive discriminator enhancement (Karras et al., 2020), enable GAN to effectively generate high-fidelity images under limited data conditions , but has not yet been applied to any medical image analysis tasks. We expect these new methods to also show good performance in future research in the field of medical image analysis.

Self-supervision can be constructed through excuse tasks or contrastive learning, but contrastive learning seems to be a more promising research direction. This is because, on the one hand, directly using pretext tasks in computer vision (e.g. jigsaw puzzles) is often insufficient to ensure learning of robust feature representations for radiological images. On the other hand, designing new excuse tasks can be difficult, requiring delicate maneuvers. Instead of designing various pretext tasks, self-supervised contrastive learning trains the network to capture meaningful feature representations, forcing them to remain invariant to different augmented views, which may outperform supervised transfer learning on different downstream tasks, such as medical image classification and segmentation. Although the performance of self-supervised contrast learning is encouraging, its application in radiological image analysis is still in the exploratory stage, and how to rationally utilize this new learning paradigm is a difficult problem. To this end, we make suggestions from the following three aspects: (1) Make full use of the advantages of contrastive learning and supervised learning. From existing research, we find that most medical image analysis adopts two separate steps: contrastive pre-training on unlabeled data and supervised fine-tuning on labeled data. In the pre-training stage, most studies rely on relatively large, unlabeled datasets to ensure learning of high-quality, transferable features that can yield better performance after fine-tuning with limited labeled data . However, reliance on large amounts of unlabeled data can be problematic in tasks that lack large amounts of unlabeled data. To expand the scope of applications, high-quality feature representations need to be learned with less unlabeled data. One possible approach is to unify the two separate steps mentioned earlier into one step so that the label information can be exploited in contrastive learning. This is somewhat reminiscent of semi-supervised learning, which utilizes both unlabeled and labeled data to achieve better performance. More specifically, class labels can guide the construction of positive and negative pairs in a more compact manner by aligning images from the same class more closely in a lower-dimensional representation space (Khosla et al, 2020). Features learned in this way require less unlabeled data and are less redundant than features learned only through self-supervised learning (i.e. without any class labels). (2) Consider certain characteristics of contrastive learning to obtain better performance. For example, one study demonstrated that contrastive learning benefits more from large numbers of similarities than from pairs (Saunshi et al, 2019). This heuristic may be well suited for learning transferable features with continuous anatomical similarity from 3D CT and MRI volumes. (3) Customize data enhancement strategies for downstream tasks that are sensitive to data enhancement. In most existing contrastive learning frameworks, the combination of different data augmentation strategies is crucial for learning representative features. For example, SimCLR applies three types of transformations to unlabeled images, namely random cropping, color distortion, and Gaussian blur (Chen et al, 2020a). However, some commonly used enhancement techniques may not be suitable for medical images. In radiology, where most images are presented in grayscale, color distortion strategies may not be appropriate. Furthermore, in cases where fine-grained details of unlabeled medical images carry important information, using Gaussian blur in the pre-training stage may destroy detail information and reduce the quality of feature representations. Therefore, it is very important to choose an appropriate data augmentation strategy to ensure satisfactory downstream performance. Furthermore, self-supervised contrastive pre-training is currently hampered by the high computational complexity of large models (e.g., ResNet-50 (4 ×), ResNet-152 (2 ×)), which require large numbers of multi-core TPUs (Chen et al, 2020a ). Therefore, developing new models or training strategies to improve computational efficiency should be an important direction. For example, Reed et al. (2022) proposed a hierarchical pre-training strategy that made the self-supervised pre-training process converge 80 times faster and improved accuracy across different tasks 2020a). Therefore, developing new models or training strategies to improve computational efficiency should be an important direction. For example, Reed et al. (2022) proposed a hierarchical pre-training strategy that made the self-supervised pre-training process converge 80 times faster and improved accuracy across different tasks 2020a). Therefore, developing new models or training strategies to improve computational efficiency should be an important direction. For example, Reed et al. (2022) proposed a hierarchical pre-training strategy that made the self-supervised pre-training process converge 80 times faster and improved accuracy across different tasks

Like self-supervised contrastive learning, recent semi-supervised methods such as FixMatch (Sohn et al., 2020) rely heavily on advanced data augmentation strategies to achieve good performance. To facilitate the application of semi-supervised learning in medical image analysis, it is necessary to develop appropriate augmentation strategies in a dataset-driven and/or task-driven manner. "Dataset-driven" means finding the best augmentation strategy for the specific dataset of interest. In the past, this was not easy to achieve as the parameter search space was very, very large (e.g. 1034 possible augmentation strategies shown by Cubuk et al (2020)). Recently, automatic data augmentation strategies such as RandAugment (Cubuk et al., 2020) have been proposed to significantly reduce the search space. However, the concept of automatic enhancement remains largely unexplored in medical image analysis. "Task-driven" means finding a suitable enhancement strategy for a specific task (e.g., MRI prostate segmentation) with multiple datasets. This can be seen as an extension of dataset-driven augmentation and is therefore more challenging, but it can help algorithms developed on one dataset generalize better to other datasets for the same task.

Another issue is the potential performance degradation caused by violating the basic assumption of semi-supervised learning - that labeled and unlabeled data come from the same distribution. In fact, distribution mismatch is a common problem when semi-supervised methods are applied to medical image analysis. Consider the following example: In the task of segmenting COVID-19 lung infections from CT slices, suppose you have a set of labeled CT volumes that contain a relatively balanced number of infected and uninfected slices, while the available unlabeled CT volumes may Contain no or only a few infected sections. Or the unlabeled CT images contain not only COVID-19 infection, but also some other disease categories (such as tuberculosis) that are not found in the labeled images. What happens if the distribution of unlabeled data does not match the distribution of labeled data? Existing research shows that this will cause the performance of semi-supervised methods to drop dramatically, sometimes even worse than simple supervised baselines (Oliver et al., 2018; Guo et al., 2020). Therefore, it is necessary to tune semi-supervised algorithms to tolerate distribution mismatch between labeled and unlabeled medical data. As a related field, "domain adaptation" may provide insights into achieving this goal.

4.1.3 Looking for better architecture and processes


The continued success of deep learning in medical image analysis stems not only from different learning paradigms (unsupervised, semi-supervised), but perhaps to a greater extent from the architectures/models that have been proposed over time. Looking back, we see non-trivial improvements closely related to “architectural” advances such as AlexNet (Krizhevsky et al., 2012), residual connections (He et al., 2016), skip connections (Ronneberger et al., 2015), Self-attention (Dosovitskiy et al., 2020), etc. As Yuille and Liu (2021) point out, “Given this history of progress, it is certainly possible that better neural architectures themselves can overcome many of the current limitations”. We discuss two aspects that may help in finding a better architecture . First, biological and cognitive heuristics will continue to play an important role in architectural design. Deep learning neural networks were originally inspired by the architecture of the brain's environment. In recent years, inspired by the visual attention mechanism of primates, the concept of attention has been successfully applied in natural language processing and computer vision, allowing the model to focus on important parts of the input data, thereby achieving excellent performance. A prominent example is Transformerthe family based on self-focus (Dosovitskiy et al., 2020). Compared with mainstream CNN-based models, the CNN-based Transformerarchitecture does a better job of capturing global/long-range dependencies between input and output sequences. Furthermore, the inductive biases inherent in CNNs (e.g., translation equivariance and locality) areTransformermuch less (Dosovitskiy et al., 2020). In addition to the attention mechanism, many other biological or cognitive mechanisms, such as the dynamic hierarchical structure of human language, one-shot learning of new objects and concepts without gradient descent, etc. (Marblestone et al, 2016), may be useful for designing more powerful Architecture provides inspiration. Second, automating architecture engineering may help develop better architectures. Most of the architectures currently in use come from human experts, and the design process is iterative and error-prone. Partly for this reason, models used for medical image analysis are mainly adapted from models developed in computer vision. To avoid the need for manual design, researchers have proposed automated architecture engineering, and one related field is neural architecture search (NAS) (Zoph and Le, 2017). However, most existing NAS research is limited to image classification (Elsken et al., 2019), and truly revolutionary models that can bring fundamental changes have not emerged from this process (Yuille and Liu, 2021). Nonetheless, NAS is still a direction worth exploring.

On a broader level, a process with automated configuration capabilities is desirable. Although architecture engineering still faces many difficulties, developing automated processes capable of automatically configuring its subcomponents (e.g., selecting and adapting an appropriate architecture within an existing architecture) for better performance would benefit radiological image analysis. Currently, deep learning-based pipelines typically involve several interdependent sub-components, such as image pre- and post-processing, adapting and training network architectures, selecting appropriate losses, data augmentation methods, etc. But there are often too many design options for experimenters to manually figure out the best pipeline. Furthermore, a high-performance pipeline configured for one dataset for a specific task (e.g., CT images from one hospital) may perform poorly on another dataset for the same task (e.g., CT images from a different hospital). Therefore, pipelines that can automatically configure their subcomponents are needed to speed up empirical design. Examples that fall within this scope include NiftyNet (Gibson et al., 2018b), a modular pipeline for different medical applications, and nnU-Net (Isensee et al., 2021) specifically for medical image segmentation. We expect more research to come from this track.

4.1.4 Integrate domain knowledge


Domain knowledge is an important but sometimes overlooked aspect of medical image analysis that can provide insights for developing high-performance deep learning algorithms. As mentioned earlier, most models used in medical vision are adapted from models developed for natural images ; however, due to unique challenges (e.g., high inter-class similarity, limited size of labeled data, label noise), medical images Often more difficult to handle . If used properly, domain knowledge helps reduce time and computational costs, thus mitigating these problems . For researchers with a strong deep learning background, it is relatively easy to exploit weak domain knowledge, such as anatomical information in MRI and CT images (Zhou et al., 2021, 2019a), multi-instance data from the same patient (Azizi et al. , 2021), patient metadata (Vu et al., 2021), radiological characteristics, and text reports accompanying the images (Zhang et al., 2020a). On the other hand, we observed that it may be more difficult to effectively integrate strong domain knowledge with which radiologists are familiar. One example is the identification of breast cancer through mammograms. Four mammograms were obtained for each patient, including two left (L) and two left ® breasts with craniocaudal (CC) and mediolateral oblique (MLO) views. In clinical practice, bilateral differences (such as LCC vs. RCC) and unilateral correspondences (such as LCC and LMLO) are important clues for radiologists to detect suspicious areas and determine malignancy. Currently, there are few ways to reliably and accurately exploit this expert knowledge. Therefore, more research efforts are needed to maximize the utilization of strong domain knowledge.

4.2 Towards large-scale application of deep learning in clinical settings


Although deep learning is widely used in academic and industrial research institutions to analyze medical images, it has not had the significant impact in clinical practice that we expected. COVID-19 is the first global pandemic in the era of deep learning, and this was clearly reflected in the early stages of the fight against the epidemic. Due to its widespread medical, social, and economic consequences, this pandemic can largely be viewed as a major test that examines the current status of deep learning algorithms in clinical translation. Shortly after the outbreak, researchers around the world applied deep learning technology, mainly analyzing chest X-rays and CT images of patients with suspected infection, aiming to accurately and effectively diagnose/prognose the disease. For this purpose, many methods based on deep learning and machine learning have been developed. However, after a systematic review of more than 200 predictive models from 169 studies published as of July 1, 2020, Wynants et al. (2020) concluded that all of these models had high or unclear bias Risks, and therefore none of them are suitable for clinical use - each model reported moderate or excellent performance; however, the optimistic results were highly biased due to model overfitting, poor evaluation, improper use of data sources, etc. Another review paper (Roberts et al, 2021) also reached a similar conclusion - after reviewing 62 studies selected from 415 studies, the authors concluded that due to methodological flaws and/or potential Bias, neither deep learning nor machine learning models identified can be used for clinical diagnosis/prognosis of COVID-19 .

Beyond the example of COVID-19, the high risk of bias in deep learning methods is indeed a recurring problem in different medical image analysis tasks and applications (Nagendran et al., 2020), which severely limits the use of deep learning in clinical radiology. potential. While quantifying potential bias is difficult, it can be reduced if handled correctly. Below we summarize three main aspects that may bias results and present our recommendations .

4.2.1 Image dataset


Data is the foundation of deep learning. In the field of medical vision, increasingly larger medical image datasets (e.g., typically at least hundreds of images) have been or are being developed to facilitate the training and testing of new algorithms. A notable example is the annual MICCAI challenge, where benchmark datasets for different diseases (e.g., cancer) are released, greatly contributing to the advancement of medical vision. However, we need to be cautious about the potential bias caused by using a single public dataset alone—which may suffer from community-wide overfitting as the community strives to achieve state-of-the-art performance (Roberts et al, 2021). This problem has been recognized by many researchers, so it is common to see several public and/or private datasets used to more fully test the performance of new algorithms . In this way, community-wide bias is reduced, but not to the extent of large-scale clinical application

Community-wide bias can be further reduced by incorporating additional data to train and test the model . One direct way to introduce new data is of course data management , the continuous creation of large, diverse data sets through collective work with experts. Instead, we recommend a less direct but effective approach – integrating fragmented private data sets where ethical and legal regulations allow . The medical image analysis community may have an overall impression that there seems to be a constant lack of large, representative, labeled data. However, this is only partially true. Many established public datasets are limited in size and variety due to time and cost constraints. On the other hand, rich medical image sources (labeled and unlabeled) of different sizes and difficulty levels already exist, but inconveniently exist “in the form of islands” (Yang et al., 2019). Due to factors such as privacy protections and political complexities, most existing data sources are confidential and scattered among different institutions in different countries. Therefore, it is desirable to exploit the unifying potential of private datasets and even personal data without compromising patient privacy . One promising approach to achieve this is federated learning (Li et al, 2020f), which allows models to securely access sensitive data. Federated learning can collaboratively train deep learning algorithms on multi-institutional data without exchanging data between participating institutions (Rieke et al, 2020). Although this technology comes with new challenges, it helps to learn algorithms that are less biased, more general, more robust, and perform better to better meet the needs of clinical applications.

4.2.2 Performance evaluation


Most medical image analysis research papers report model performance through commonly used metrics, such as accuracy and AUC for classification tasks, and Dice coefficient for segmentation tasks. Although these metrics can easily quantify the technical performance of a proposed method, they often do not reflect clinical applicability. Ultimately, clinicians are concerned with whether the use of algorithms will bring about beneficial changes in patient care, rather than the performance improvements reported in the paper (Kelly et al, 2019). Therefore, in addition to applying the necessary metrics , we believe it is important for research teams to collaborate with clinicians on algorithm evaluation .

We briefly mention two possible directions for establishing collaborative evaluation . First , involve clinicians in the sharing of ideas on open clinical questions , paper writing , and even the peer review process at conferences and journals . For example, the Machine Learning for Healthcare (MLHC) conference provides a research track and clinical track for members from diverse communities to exchange insights. Second , measure whether clinician performance and/or efficiency can be improved with the help of deep learning algorithms . Several studies have explored the use of model results as a “second opinion” to facilitate the clinician's final interpretation. For example, in the task of predicting breast cancer from mammograms, McKinney et al. (2020) evaluated the complementary role of deep learning models. They found that the model could correctly identify many cancer cases that radiologists missed. Furthermore, in a "re-reading process" (standard practice in the UK), the model significantly reduces the workload of the second reader while maintaining performance comparable to consensus opinion.

4.2.3 Reproducibility


The rapid development of computer vision is closely related to a research culture that advocates reproducibility. In medical image analysis, more and more researchers are choosing to make their code public, which greatly helps avoid duplication of work. More importantly, good reproducibility can help deep learning algorithms gain the trust and confidence of a wider group of people (such as researchers, clinicians), which is beneficial to large-scale clinical applications. To make the results more reproducible, we recommend paying special attention to describing the data selection in the paper . It is not uncommon to see different studies selecting different subsets of samples from the same public dataset. This may increase the difficulty of reproducing the results stated in the paper. Baltatzis et al. (2021) showed in a case study of pulmonary nodule classification that the specific selection of data was beneficial in demonstrating the superiority of the proposed model . An advanced model with additional features may perform worse than a simple baseline if the data sample changes . Therefore, it is necessary to clearly explain the data selection process to make the results more reproducible and convincing .

In short, deep learning is a rapidly developing technology with broad application prospects in a wide range of medical image analysis fields such as disease classification, segmentation, detection, and image registration. Despite significant research progress, we still face many technical challenges or pitfalls (Roberts et al, 2021) to develop deep learning-based CAD schemes that achieve a high degree of scientific rigor. Therefore, more research efforts are needed to overcome these shortcomings before deep learning-based CAD solutions are generally accepted by clinicians.

Guess you like

Origin blog.csdn.net/hu_wei123/article/details/132359286