[Study Notes] Application of Deep Learning in Medical Image Registration "Deep Learning in Medical Image Registration: A Survey"

Medical image registration is to transform different medical images into the same coordinate system according to the matching principle of image content. It is very necessary for medical image registration to deal with image pairs from different observation points, different times, or using different equipment (CT, MR, US, etc.). The traditional registration method is performed by experienced professionals for manual standard registration. This manual standard method is time-consuming and labor-intensive and may cause large errors. In order to improve the efficiency and reliable accuracy of registration, automatic registration came into being. Non-deep learning registration methods were also very popular until the introduction of the AlexNet network in 2012. Deep learning was widely used because of its ideal performance in the fields of target detection, feature extraction, image segmentation, image classification, image annotation, and image reconstruction. application.
Initially, deep learning was used to improve the performance of iterative and intensity-based registration. Later, scholars used reinforcement learning for intuitive image registration, and due to the need for fast registration methods, the development of computational deformation evaluation was promoted. And due to the lack of gold standard data, the development of unsupervised one-step transformation is promoted. The difficulty of unsupervised methods is to use an appropriate evaluation function for similarity measurement. Solutions include using similarity measures based on information theory, segmentation based on anatomical structures, and generative adversarial networks.
insert image description here
The above picture shows different methods of deep learning processing registration.
This article will introduce the following three aspects:
1. Depth iterative registration;
2. Supervised transformation estimation;
3. Unsupervised transformation estimation;

1. In-depth iterative registration

Intensity-based automatic registration methods require a similarity measurement function and an optimization function to update the transformation parameters so as to obtain the maximum similarity between images. Some common similarity assessment methods include: sum of squared differences (SSD), cross-correlation (CC), mutual information (MI) normalized cross correlation (NCC), and normalized mutual information (NMI).

1. Registration based on depth similarity

The registration flow chart is as follows:
insert image description here
Among them, the implementation indicates the data flow that transmits data during training and testing, and the dotted line only needs to be used during the training phase.

1.1 Single-modal registration

Some scholars use convolutional stacked autoencoder (CAE) to extract deformable unimodal image features, and then use gradient descent method to optimize NCC; some scholars only use 3D-CNN to evaluate the mapping error between image features; Some scholars compared the CNN descriptor with the manually labeled MRF self-similar descriptor and found that the effect of the CNN descriptor is not as good as that of the MRF descriptor, but it can be used to supplement information.

1.2 Multimodal Registration

Manually annotated similarity measures have had little success in multimodal image registration. So some scholars proposed to use stacked denoising autoencoder (stacked denoising autoencoder) for similarity measurement, and found that the effect is better than NMI and local mutual information (LCC); some scholars use CNN to evaluate the dissimilarity between aligned multi-modal images and use the gradient descent method to update the parameters of the deformation field; in addition, some scholars use a five-layer neural network for similarity measurement (rigid registration) and use Powells method for optimization; some scholars use CNN to predict the target registration error (target registration error, TRE) use evolutionary algorithms to explore the solution space to solve the lack of convexity of learning metrics before using traditional optimization methods; some scholars use long-short-term memory spatial co-transformer networks (LSTM spatial co-transformer networks) for iterative registration
, The network consists of three steps: distorted image, residual parameter prediction and parameter combination. The
above two types of registration show that deep learning can be applied to challenging registration tasks, but some results show that image similarity measures are suitable for single-mode It is used as supplementary information when it is in state, and the iterative method is not suitable for real-time registration.

2. Registration based on reinforcement learning

insert image description here
Using a trained self-energy body instead of a predefined optimization algorithm, reinforcement learning usually deals with rigid registration problems and can also handle non-rigid registration problems.
Some scholars use greedy supervision algorithm and attention-driven hierarchical strategy for end-to-end training, and the results of rigid registration are better than MI-based registration and semantic registration of probability mapping; some scholars use Q-learning and use context content to determine Project the depth of the image, and use the duel network, and also distinguish between terminal and non-terminal rewards; some scholars use multiple self-energy systems and use automatic attention mechanisms to observe multiple regions, proving the effectiveness of the multi-agent mechanism ; Some scholars use low-resolution (limited action space dimension) models for registration and fuzzy action control to affect the selection of random actions; the defect of reinforcement
learning is the lack of ability to handle high-resolution (handling non-rigid deformation, deformation action complex).

2. Supervised transformation estimation

The iterative operation slows down the registration time, especially in high-dimensional deformable registration. From this, the need for a prediction transformation is derived, and the fully supervised registration requires championship data to define the loss function.
insert image description here

1. Fully Supervised Transform Estimation

The use of neural networks instead of iterative operations greatly improves the time of the registration process. However, due to the high power consumption of the FC fully connected layer in the high-latitude solution space, traditional convolutional neural networks are usually not used for deformable transformation prediction models. Since the predictive deformation network is fully convolutional, no additional computational constraints are introduced to limit the solution space.

1.1 Rigid registration

Some scholars use CNN to predict the rigid registration matrix, using a hierarchical regression method that divides six transformation parameters (x, y-axis 1mm displacement and 1° rotation) into three groups, and using the transformed data as the gold standard to improve The reason for the efficiency is the forward propagation without using the optimization algorithm for registration; some scholars use the MSE between the standard affine transformation and the prediction transformation to train the affine image registration network (AIRNet); and some scholars Use the residual regression network (for initial registration), the correction network (enhance the range of registration) and the bivariant geodesic distance based loss function (bivariant geodesic distance based loss function); in addition, scholars put the pairwise domain adaptation module (a pairwise domain adaptation module (PDA) into the pre-trained CNN network, the domain adaptation module is used to alleviate the difference between synthetic data and real data; other scholars use CNN to return T1 and T2 weighted MRI rigid registration transformation Parameters, a single-modal and multi-modal scheme are proposed at the same time. In the single-modal mode, the convolution parameters for extracting low-level features are shared, and in the multi-modal mode, the parameters are learned separately.

1.2 Deformable Registration

Some scholars use U-Net and FCN networks to obtain variable deformation fields and use large differential pure metric mapping (large diffeomorphic metric mapping) to provide bias parameters; then some scholars use improved U-Net networks to calculate the given The reference transformation of the image pair, and use SSD as the loss function; some scholars use CNN to obtain the displacement vectors corresponding to the images for the input patches, and all the displacement vectors constitute the registration field. They also use the similarity between the input images to assist training. It should be noted that they still use a balanced active point-guided sampling strategy to make pixel blocks with higher gradient sizes and displacement values ​​more likely to be sampled and trained; some scholars use CNN for non-motion correction registration, and some scholars will transform parameters The low-level Hession approximation of the variational Gaussian distribution quantifies the uncertainty related to deformable registration; some scholars use DVFs to enhance the data set, and use multi-scale CNN to process the data, get different feature maps and then get the same size after processing. The large and small feature maps are combined (dual-channel) as input data for training; and the use of 3D-CNN network (reducing the use of medical images when supervised) for multi-scale and random transformation reduces the need for standard data sets, and also retains Different from the above methods, some scholars use statistical appearance models (SAMs) to generate convenient data and then use FlowNet network for training. This method is better than CNN to randomly generate data; some scholars use CNN Learning plausible variants of standard data generation.
The limitation of supervised transformation prediction methods is that the quality of the registration is determined by the quality of the annotated dataset, and the annotated data is determined by the expertise of the designer (fewer professionals have this skill), although using synthetic standard datasets This problem can be alleviated to some extent, but it is very necessary to ensure the similarity between synthetic data and clinical data.

2. Dual/Weakly Supervised Transform Estimation

The double-supervised transformation is to train the model using standard data and a measure of quantitative similarity, and the weakly-supervised transformation is to use the relevant anatomical structure to segment and overlap to design a loss function.

insert image description here

2.1 Dual supervision

Some scholars use a layered, double-supervised strategy, using an improved U-Net network (through the gap filling "gap-filling" (insert the convolutional layer after u-net ends) modified from coarse to fine) using the prediction map and the standard There are two similarities between graph similarity and the similarity between the distorted image and the fixed image; some scholars are inspired by the GAN network, and the design generator is used to predict the rigid transformation, and the discriminator is used to use the image aligned with the standard transformation and the image aligned with the predicted transformation. Image, using Euclidean distance and adversarial loss to form a loss function.
insert image description here

2.2 Weak supervision

Some scholars use label similarity for training, using local network (predicting local dense deformation field) and global network (predicting global radial transformation with 12 degrees of freedom) for training and learning, and the local network takes the moving image output by the global network The transformation of , and the fixed image are combined as input, followed by an end-to-end framework.
insert image description here
Some scholars use both maximizing label similarity and minimizing an adversarial loss item in order to obtain a more realistic prediction image; some scholars introduce labels and similarity metrics based on double supervision and weak supervision. The loss function of the heart dynamic tracking is also used to construct the loss function using segmentation overlap and edge-based normalized gradient field distance.
A prediction is already a major breakthrough for deep learning, but it is highly dependent on data. Partial/weak supervision effectively alleviates the demand for standard data sets (standard data sets with label values), but manual work is still required. label. Weak supervision allows similarity quantification in the context of multimodality.

3. Unsupervised Transform Estimation

3.1 Similarity measure based on unsupervised transformation estimation

The flow chart of similarity measurement, compared with the above double supervision, shows that the gold standard loss value is missing, so it is unsupervised.
insert image description here
In order to overcome the difficulty of obtaining standard data, some scholars have proposed to use FCN for variable registration, using NCC and other regularization items (eg, smooth constraints) as loss functions, many manually defined similarity measures are not suitable for multimodal The situation is used for single-modal situations; some scholars use NCC to train FCN for registration, use DVF to register moving images, or use the Elastix toolbox; some scholars use multi-stage multi-stage Scale method for single-modal registration, use NCC and bending energy regularization term to train the network, predict affine transformation, and then use B-spline transformation model to perform coarse-to-fine deformation; some scholars minimize distorted image and matching The SSD upper bound between quasi-images; some scholars use unsupervised deformation registration from coarse to fine, using MSE as the loss function; some scholars use eight fully connected layers to register the learned latent representation, and embedding The deformation field is obtained by the method, and the sum of absolute errors (SAE) is used as the loss function; another scholar uses CNN for linear and local registration, and estimates the affine transformation (rigid) and deformation (deformable), and the loss function Using MSE and a regularization term; some scholars use neural networks to learn the relationship between similarity measures and TRE in order to enhance the reliability of registration; later some scholars use cascaded variable prediction as variational inference , combine the differential integral with the transformer to obtain the velocity field, and then square and rescale the velocity field and then integrate to obtain the deformation field, and also use MSE and regularization terms as the loss function; also propose a Fiam framework, using the initialization module , low-capacity mode and use residual links instead of skip connections, which also show good performance compared to VoexlMorph; another scholar uses a method based on migration learning, using a U-Net-like framework for feature extraction and transformation estimation, using NCC As a loss function; although it is difficult to measure the similarity of multimodality, some scholars have proposed a 3D-CNN model consisting of a feature extractor and a deformation field generator, using pixel intensity and gradient information for training.

Some scholars use the image similarity within the modality for multimodal deformation registration. NCC calculates the moving image distorted by the standard deformation field and the moving image distorted by the predicted transformation as the loss function; another scholar uses the reverse consistent deep network (Inverse Consistent Deep Network, ICNet) learns the symmetric differential transformation of each MR image aligned to the same space (I understand that there is only one fixed image), and uses the inverse consistent regularization term and anti-folding regularization term, so that a highly smooth constraint does not will cause the deformation field to collapse, using MSE as the similarity measure.
The following methods describe unsupervised GAN networks. Use the GAN network to implicitly learn the density function of a reasonable deformation range. In addition to NMI, a structural similarity index measure (SSIM) and a feature-aware loss item are added. The loss function is composed of conditional constraints and cyclic constraints; also Some scholars use the GAN network for registration, and use the discriminator to evaluate the alignment quality, which is better than using real data, SSD, and CC on the data set; another scholar uses GAN to perform segmentation and registration at the same time, using three inputs, fixed, The segmentation mask of the moving and fixed images outputs the segmentation mask and deformation field of the transformed image, and the three discriminators use cycle consistency and DICE coefficients to evaluate the effects of deformation field, distorted image, and segmentation, respectively.
Some scholars use multi-grid B-splines and L1 norm regularized CNN to learn the optimal parameterization of deformation, use SSD as similarity measure, L-BFGS-B as optimization, compared with traditional L1 norm regularized multi-network Lattice registration converges quickly.
Most unsupervised methods are suitable for single-modal registration, and more unsupervised multi-modal registration methods need to be developed.

3.2 Feature-based Unsupervised Transformation Prediction

insert image description here
This is an unsupervised feature-based registration flow chart. Compared with the above figure, there is one more feature extractor, which maps the input image to the feature space, which is convenient for the transformation and prediction of parameters.

3.2.1 Single modality registration

Some scholars train an autoencoder to reconstruct a fixed image, using the L2 distance between the reconstructed fixed image and the corresponding distorted image and several regularization items as the loss function; some scholars use the tensor-based MIND method using the principal component analysis network. Single-modal and multi-modal registration; other scholars use random latent space learning methods to bypass spatial regularization for registration, use conditional variational autoencoders to ensure that the parameter space follows the prescribed probability distribution, and use latent representations of fixed images The negative logarithm and twist and KL divergence of are used as loss functions.

3.2.2 Multimodal registration

A registration method using unsupervised feature extraction and affine transformation parameter regression with the trained network, using DICE coefficients as the cost function.
Expect more research on multimodal registration, especially those with significant appearance differences.

4. Research trends and future directions

4.1 Deep Adversarial Image Registration

Adversarial networks can use discriminators as learning similarity measures, can ensure the authenticity of predictive transformation models, and can perform image translation (transform multimodal problems into unimodal problems). Some scholars use discriminators to discriminate between aligned and unaligned image pairs, but pre-aligned image pairs are required. The discriminator is used to distinguish all misaligned image pairs with the same label, and it is not desirable to establish a wrong spectral coordinate for judgment. The deformation field predicted by deformable registration has a high probability of being unreal. Usually, the L2 norm, gradient or Laplace constant term is added to the loss function, but this will limit the size of the predicted deformation, so some scholars propose to use the class GAN network, using the discriminator to constrain deformation prediction. Image translation also benefits from the use of unimodal similarity measures, and if image translation becomes necessary in the preprocessing process, a commonly used similarity measure can be used to define the loss function.

4.2 Registration based on reinforcement learning

The method based on reinforcement learning is more intuitive and can imitate doctor registration. The challenges faced by deep learning include deformation registration to obtain the dimension of the transformation space, and the method based on reinforcement learning can address this problem,

4.3 Raw Image Domain Registration

There are sufficient examples to demonstrate that deep learning can map data points in the original data domain to the reconstructed image domain for reconstruction.

5. Summary

There are more and more deep learning methods, but they all face their own challenges. The common challenges are: lack of multi-modal similarity measurement, data sets with sufficient data, lack of standard data and quantitative model prediction. Uncertainty. Resampling and interpolation are often overlooked by researchers and should be given enough attention.

For learning only, infringement will be deleted

Guess you like

Origin blog.csdn.net/qq_53312564/article/details/122895279