A review of deep learning methods for medical image fusion

Deep learning methods for medical image fusion: A review

Summary

Image fusion methods based on deep learning have been a research hotspot in the field of computer vision in recent years.
This paper reviews these methods from five aspects:
first, it explains the principles and advantages of image fusion methods based on deep learning;
second, it summarizes image fusion methods from both end-to-end and non-end-to-end aspects. According to For the different tasks of deep learning in the feature processing stage, non-end-to-end image fusion methods are divided into two categories: decision mapping deep learning and feature extraction deep learning.
According to different network types, end-to-end image fusion methods are divided into three categories:
image fusion methods based on convolutional neural networks, image fusion methods based on generative adversarial networks, and image fusion methods based on encoder-decoder networks;
Section Third, the application of image fusion methods based on deep learning in the field of medical images is summarized from two aspects: methods and data sets; fourth, the
commonly used evaluation indicators in the field of medical image fusion are sorted out from 14 aspects;
fifth, from The main challenges in medical image fusion are discussed in terms of both datasets and fusion methods.
And the future development direction is prospected. This article systematically summarizes the image fusion method based on deep learning, which has positive guiding significance for in-depth research on multi-modal medical images.

introduction

Image fusion algorithms can be divided into two categories: transform domain algorithms and spatial domain algorithms.
Transform domain-based algorithms are usually based on multi-scale transform (MST) theory, such as Laplace Pyramid Transform (LP), Wavelet Transform (WT), Curvelet Transform (CVT) and Non-subsampled Contourlet Transform (NSCT). The steps of these methods are as follows: first decompose the source image into coefficients, then fuse the coefficients through fusion rules, and finally reconstruct the fused image through inverse transformation. In addition to the MST method, some feature space-based methods have also been proposed in recent years, such as independent component analysis (ICA) and sparse representation (SR). However, these methods have some shortcomings: fusion rules are designed by developers, fused images require registration, and image reconstruction can also lead to a decrease in image quality.
The algorithm based on the spatial domain does not need to convert the source image into another feature domain. It has good application prospects and can be divided into block-based, region-based and pixel-based fusion algorithms. Blocking-based algorithms usually divide the image into blocks, measure their spatial frequencies, sum and modify the Laplacian, and then fuse the image blocks. In these algorithms, the size of the image blocks has a great influence on the results and is difficult to segment; based on The regional algorithm decomposes the input image into regions according to certain criteria, then measures the significance of the corresponding regions, and finally combines the most significant regions into a fused image. However, the accuracy of image segmentation has a great impact on the efficiency of the algorithm. Pixel-based algorithms directly generate fusion decision maps through activity level measurement strategies, and some pixel-based spatial domain methods are proposed, such as multi-scale weighted gradient fusion (MWGF), image fusion with guided filtering (GFF) and dense SIFT. The method is based on single pixels in the fusion process and ignores the similarity of information.

In recent years, the development of deep learning has promoted the progress of image fusion. The powerful feature extraction and data expression capabilities of deep learning make the development of image fusion very promising. Deep learning methods learn fusion models with good generalization capabilities from a large amount of data, which can make the fusion process more robust and overcome the shortcomings of manual feature selection, such as time-consuming, high cost, and prone to human errors, showing strong development potential. .

Image fusion based on deep learning is divided into three stages: feature processing stage, feature fusion stage and feature reconstruction stage .
The specific process is as follows: first, feature information or decision graphs are obtained through the deep learning network; then, they are fused through the fusion strategy. Finally, the fused image is obtained by inverse transformation through feature processing. Due to the powerful capabilities of deep learning networks in feature extraction and information expression, the quality of fused images can be significantly improved.
This article divides deep learning-based fusion methods into two categories: non-end-to-end image fusion methods and end-to-end image fusion methods.

Non-end-to-end fusion method

Non-end-to-end image fusion methods refer to the application of deep learning networks in the feature processing stage before the fusion stage. The process is as follows: first, the source image is processed through the deep learning network to obtain feature information or decision map; then, the features are fused according to the fusion rules; finally, the fused features are reconstructed to obtain the final fused image.
Effective feature processing methods are the prerequisite for high-quality fusion methods. The development of image representation theory has a great impact on the progress of image fusion, promoting the further improvement of fusion rules. This section will introduce it from two aspects: decision mapping based on deep learning and feature extraction based on deep learning.
Insert image description here

Decision mapping based on deep learning

First, the source image is divided into blocks, these blocks are used as inputs to the network, and a classification task is constructed to determine the category of each block; secondly, by linear convolution, nonlinear activation and spatial pooling of the feature map, different stages of The feature maps are merged to output a decision map containing the feature information of the source image; third, the decision map is processed; finally, the decision map is fused using fusion rules to obtain the final fused image.
Insert image description here
The first category is CNN-based decision mapping. Wang et al. input the decomposed high-frequency subbands into CNN to generate a decision graph, and used CNN as a fusion rule for frequency subbands. This rule is not only adaptive, but also replaces the traditional rules that require manual design, and then respectively The decision graphs of low-frequency subbands and high-frequency subbands are fused, and finally the fusion coefficients are inversely transformed to obtain the final fused image.
The second category is decision mapping based on ResNet. In order to solve the problem of difficult boundary blur level estimation, it is proposed to input the source image into a CNN composed of convolution blocks and residual blocks, extract shallow features and deep features, obtain their corresponding weighted mappings, and perform dot product and weighted sum on them operation to obtain the fused image. This method can exploit the complementary information present in the source image.
The third category is DenseNet-based decision mapping. Gai et al. input the source image blocks into DenseNet to obtain the scoring map, then obtained the decision map through binarization, and finally used the fusion rules to obtain the fused image. In the feature processing stage, DenseNet can make full use of the feature information of the image and effectively solve the classification problem of the fused image decision map. The
fourth category is decision mapping based on U-Net. In order to improve the global feature encoding ability of U-Net, global features are introduced The pyramid extraction module (GFPE) and the global attention connection upsampling module (GACU) effectively extract and utilize global semantic and edge information, estimate the final decision map through the contextual relationship between pixels in the feature map, and finally use a pixel-by-pixel weighted average Strategies to Obtain Fusion Images
The fifth category is GAN-based decision mapping. Guo et al. proposed an image fusion method based on cGAN, called fusion GAN. This method regards the image fusion task as a conversion problem from source image to decision map, and uses the least squares GAN objective to improve the training stability of fusion GAN, and obtain accurate confidence maps for focus region detection

Feature Extraction Based on Deep Learning

Insert image description here
The process of feature extraction in deep learning is: first, input the source image into the deep learning network for feature extraction, then fuse the feature information of each output layer through fusion rules, and finally obtain the fused image through the reconstruction process. The input of the network is the source image, and the output is feature information. These processes are shown in Figure 3. Deep learning methods have stronger feature extraction capabilities than traditional methods and are widely used in the field of image fusion.
The first type is based on CNN, which extracts low-level and high-level features of the source image to obtain candidate fusion images, and uses the maximum strategy to generate the final fusion image from the candidate fusion images to complete the reconstruction of the fusion image.
The second category is feature extraction based on ResNet. Use ResNet50 as the feature extraction module to extract depth features from the source image, then normalize the depth features to obtain an initial weight map, and finally use a weighted average strategy to reconstruct the fused image.
The third category is extraction based on DenseNet. Zhang et al. used DenseNet to extract features through dense connections and reuse features, achieving better performance than CNN with fewer parameters and computational costs, and finally reconstructed the fused image through an average fusion strategy. Using fewer network layers retains more details of the fused image.
The fourth lesson is feature extraction based on the attention mechanism. A multi-scale residual pyramid attention network (MSRPAN) is proposed. Compared with the residual attention mechanism, MSRPAN increases multi-scale information. Compared with the pyramid attention mechanism, MSRPAN enhances the feature extraction ability and has better feature extraction and expression capabilities.

End-to-End Image Fusion Approach

In non-end-to-end image fusion methods, sometimes the optimal features in the feature extraction stage are not the final best results.

End-to-end image fusion means that the input of the network is the source image and the output is the fused image. The entire learning process is not divided into sub-processes, and the deep learning model learns the mapping from the source image to the fused image. End-to-end image fusion methods include CNN-based image fusion methods, GAN-based image fusion methods, and encoder-decoder network-based image fusion methods. A review of end-to-end image fusion methods is given.

Image fusion method based on convolutional neural network (CNN)

By designing the network structure and loss function, the implicit feature extraction, feature fusion and image reconstruction are realized, which avoids the limitation of manual design of fusion rules. The process of the CNN-based image fusion method is: first input the source image into the CNN for processing, then fuse the processed features, and finally perform deconvolution reconstruction on the fused image. In this process, the output of intermediate results is not required, and CNN learns a direct mapping from input to output. Compared with traditional image fusion algorithms, CNN can adapt to the image fusion task by learning appropriate parameters of convolution filters, and the parameters of the CNN model can be optimized through end-to-end training. This section summarizes the CNN-based image fusion method from four aspects: single-level feature fusion method, multi-level feature fusion method, ResNet-based image fusion method and DenseNet-based image fusion method. These processes are shown in Figure 4
Insert image description here

Single-stage feature fusion method

First, the features of the source image are extracted through multiple convolution blocks, then the features output by the last convolution layer are fused, and finally the features are reconstructed through multiple deconvolution blocks to obtain the final fused image.

UFA-FUSE uses convolution blocks to extract image features from the source image, and then performs feature fusion through the attention mechanism. Finally, the fused image features are input into the cascade convolution block to reconstruct the fused image. This method is refined through post-processing Decision map, avoiding the generation of intermediate decision maps, and realizing image fusion

In order to improve the quality of image fusion, in Multi-scale MobileNet based fusion (MMF), the high-dimensional features of the input image are extracted through Multi-scale Mobile Block (MMB), combined with high-dimensional features. dimensional features to generate fused images

Multi-level feature fusion

First, the features of the source image are extracted through the convolution block, and then the features of the corresponding layers extracted by each convolution layer are fused, and the fusion features of each layer are fused to generate the final fused image. Multi-level feature fusion can make full use of image features.

In HPCFNet, paired images are first input into the Siamese CNN, and then the feature maps of the convolutional layer are hierarchically integrated through the paired channel fusion (PCF) module to generate channel-by-channel fusion feature maps, and then through the reverse A Reverse Spatial Attention (RSA) module adjusts the fused feature maps. PCF first combines feature maps of the same level through Cross Feature Stack (CFS), and then fuses channel pairs through Parallel Atrous Group Convolution (PAGC) module to obtain multi-scale feature representation.

Image Fusion Method Based on Residual Neural Network

In CNN-based fusion methods, shallow features become loose as the number of network layers increases, reducing the fusion effect. The ResNet-based fusion method can better utilize the extracted feature information, and the fused image can retain more details of the source image. These methods can be divided into two categories: global residual connections for image fusion and residual blocks for image fusion.
Insert image description here
Insert image description here
The third category is multi-scale residual blocks. A convolutional layer with a smaller receptive field can extract low-frequency features, but not high-frequency features, while a convolutional layer with a larger receptive field can extract more important image features. Song et al. designed the multi-scale extended residual block (MDRB), which extracts multi-scale features through two parallel convolution kernels and inputs the features into two convolution kernels with different expansion rates to expand the receiving field and reduce the computational cost. Low

The residual attention block is implemented by adding an attention mechanism to the residual block. The residual block is given its weight according to the importance of the source feature map. The residual connection enables the attention mechanism to globally learn the weight of each channel, greatly Enhanced generality of attention blocks. Mustafa et al. introduced the residual self-attention block to fuse and refine features. The output of the residual self-attention block is the weighted sum of the original local features and attention maps, which also contains self-attention information and global context information.

Image Fusion Method Based on Dense Neural Network (DenseNet)

Insert image description here
The image fusion method based on DenseNet refers to adding dense connections to CNN, or replacing convolution blocks with dense blocks. These processes are shown in Figure 7. DenseNet is able to utilize shallow low-complexity information to obtain a smoother decision function and therefore has better generalization performance [42]. Compared with residual blocks, dense blocks have stronger dense connection mechanisms. Through dense connections, the features of each layer can be fully utilized to ensure that the fused image contains more multi-scale and multi-level features of the source image, thus alleviating the problem of vanishing gradients.

Image fusion method based on generative adversarial network

Since GAN [46] was proposed in 2014, it has been widely used in the imaging field due to its flexibility and excellent performance. The GAN-based image fusion process can be viewed as a confrontational game between the source image and the fused image. More specifically, the GAN-based image fusion method uses a discriminator to force the generator to generate a fusion result that is consistent with the target distribution in probability distribution. , thereby implicitly realizing feature extraction, fusion and image reconstruction. This enables the fused image to simultaneously obtain feature information from both source images. These processes are shown in Figure 8. Image fusion methods based on GAN can be divided into three categories: image fusion methods based on classic GAN, image fusion methods based on dual discriminator GAN, and image fusion methods based on multi-GAN.
Insert image description here

Image fusion method based on encoder-decoder network

Insert image description here
The image fusion method based on a single encoder-decoder network refers to a fusion network containing an encoder. The process is: input the spliced ​​image into the encoder to extract features; then, fuse the coding features; finally, fuse the Features are decoded to obtain the final fused image.

DenseFuse is a typical encoder-decoder image fusion network, where the encoder is composed of convolutional blocks and dense blocks. The dense block can better retain the deep features of the encoder, thus ensuring that the fusion rules can use more important features. The output of the fusion layer is the input of the decoder, which is completed by four convolutional layers

Image fusion method based on dual encoder-decoder network The source images are spliced ​​at the input end of the single encoder-decoder fusion network, but images of different modalities have different detailed information and need to extract features in different ways, so A dual encoder-decoder fusion network is proposed. The purpose of each encoder is to extract the visual features of the image, and its advantage is to ensure that the different information of the two modal images can be fully utilized.

Since more than two images often need to be fused in image fusion tasks, most of the existing methods aim at the fusion of two images. In order to preserve more detailed information of various image types while fusing multimodal medical images, the proposed A multi-encoder-decoder fusion network is proposed to obtain better fusion results through information sharing.

Guess you like

Origin blog.csdn.net/qq_45745941/article/details/132467785
Recommended