Learning TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation Medical segmentation

Fusing Transformers and CNNs for Medical Image Segmentation

Abstract

Medical image segmentation - the prerequisite of numerous clinical needs - has been significantly prospered by recent advances in convolutional neural networks (CNNs). However, it exhibits general limitations on modeling explicit long-range relation, and existing cures, resorting to building deep encoders along with aggressive downsampling operations, leads to redundant deepened networks and loss of localized details. Hence, the segmentation task awaits a better solution to improve the efficiency of modeling global contexts while maintaining a strong grasp of low-level details. In this paper, we propose a novel parallel-in-branch architecture, TransFuse, to address this challenge. TransFuse combines Transformers and CNNs in a parallel style, where both global dependency and low-level spatial details can be efficiently captured in a much shallower manner. Besides, a novel fusion technique - BiFusion module is created to efficiently fuse the multi-level features from both branches. Extensive experiments demonstrate that TransFuse achieves the newest state-of-the-art results on both 2D and 3D medical image sets including polyp, skin lesion, hip, and prostate segmentation, with significant parameter decrease and inference speed improvement

Introduction

CNN

  • CNN has achieved excellent performance in many medical image segmentation tasks, such as multi-organ segmentation, liver lesion segmentation, brain complement segmentation, etc., showing the powerful ability of CNN in modeling specific task feature representation
  • Disadvantages: A major problem with CNN is the lack of efficiency in capturing full-text context information. If you stack and expand the receptive field, you need continuous downsampling-convolution operations, making the network structure very deep. This process will also lead to local The loss of information, and the rejection of detailed information is also very important for dense prediction tasks

Transformer

  • Disadvantages: Transformer also has its own limitations, that is, it cannot model fine-grained features better, especially for medical images, where detailed features are very important, and it lacks spatial inductive paranoia when modeling local information

TransFuse

  • Can effectively capture low-level spatial features and high-level semantic features
  • There is no need to build a deep network to alleviate problems such as gradient disappearance and features that cannot be reused effectively
  • Model efficiency and inference speed have been greatly improved, and the efficiency of deploying on the cloud or terminal has also been greatly improved

insert image description here

Proposed Method

two parallel branches

  • CNN branch

    • Increase the receptive field and transform features from local to global
  • Transformer branch

    • Start with global self-attention and recover local details at the end
  • two benefits

    • Effectively exploit the respective advantages of CNN and Transformer, effectively capture global information without building a deep network, and maintain accurate low-level information
    • In the process of feature extraction, BIfusion utilizes the different properties of CNN and Transformer at the same time, so that the fusion is better

Transformer branch

  • The input of HXWX3 is cut into a patch (patch number= 16); then the patch is linearly mapped and flattened, and the trainable position information is embedded before being sent to the Transformer.

  • Transformer contains L-layer MSA and MLP

    • S A ( z i ) = s o f t m a x ( q i k T D h ) v SA(z_i) = softmax(\frac{q_ik^T}{\sqrt{D_h}})v S A ( zi)=softmax(Dh qikT)v
  • The result processed by the Transformer encoder will be sent to the Decoder, and the progressive upsampling method (PUP) is adopted in the decoder part, similar to the operation of SETR

  • First, the output of Z^L will be reshaped to the original two-dimensional dimension, with D0 channels; then two consecutive upsampling-convolutional layers are used to restore the spatial resolution, and finally the upsampling results of different sizes are obtained, which is similar to CNN's feature map for feature fusion

CNN branch

  • Using ResNet as a branch of CNN
  • Keep the output of the first 4 layers, and then fuse them with the results of Transformer to obtain the fused feature extraction

BiFusion Module

  • Fusion of features extracted by CNN and Transformer

    • channel attention

      • t ^ i = C h a n n e l A t t n ( t i ) \widehat{t}^i = ChannelAttn(t^i) t i=ChannelAttn(ti)

      • SE-Block

    • spatial attention

      • g ^ i = C h a n n e l A t t n ( g i ) \widehat{g}^i = ChannelAttn(g^i) g i=ChannelAttn(gi)

      • CBAM block acts as a spatial filter to enhance local details and suppress irrelevant areas. Low-level CNN features will have noise

    • 3x3 convolution

      • b ^ i = C o n v ( t i W 1 i ⋅ g i W 2 i ) \widehat{b}^i = Conv(t^iW^i_1 \cdot g^iW^i_2) b i=Conv(tiW1igiW2i)

      • Hadamard product, matrix dot product, to model fine-grained interactions between features of two branches

    • residual connection

      • f i = R e s i d u a l ( [ b i , t i , g i ] ) f^i = Residual([b^i,t^i,g^i]) fi=Residual([bi,ti,gi])
    • Use attention-gate (AG) to generate the final segmentation result

      • f i + 1 = C o n v ( [ U p ( f i ) , A G ( f i + 1 , U p ( f i ) ) ] ) f^{i+1} = Conv([Up(f^i), AG(f^{i+1}, Up(f^i))]) fi+1=Conv([Up(fi),AG ( f _i+1,Up(fi))])

Loss Function

  • Weighted mIoU loss
  • BCE loss

Experiments and Results

Data Acquisition

  • polyp segmentation

    • Kvasir, CVC-ClinicDB, CVC-ColonDB, EndoScene and ETIS

      • 352×352
  • skin damage

    • 2017 International Skin Imaging Collaboration skin lesion segmentation dataset (ISIC2017)

      • 192×256
  • Hip Segmentation

  • prostate segmentation

    • volumetric Prostate Multi-modality MRIs from the Medical Segmentation Decathlon

      • 320 × 320

Implementation Details

  • TransFuse-S

    • ResNet-34 (R34) and 8-layer DeiT-Small (DeiT-S)
  • TransFuse-L

    • Res2Net-50 and 10-layer DeiT-Base (DeiT-B)
  • TransFuse-L*

    • ResNetV2-50 and ViT-B

Evaluation Results

  • Results of Polyp Segmentation

    • mean Dice

      • The Dice coefficient is a set similarity measurement function, which is usually used to calculate the similarity between two samples, and the value is [0,1].
    • meow

    • Table1 shows the comparison results of polyp segmentation. Compared with other CNN networks, TransFuse-S/L has reached SOTA, and the number of parameters is reduced by 20% compared with PraNet, etc., and the real-time performance is better.

    • And the performance of the pre-trained TransFuse-L* is also better than SETR and TransUNet

insert image description here

  • Results of Skin Lesion Segmentation

    • In the skin lesion segmentation experiment, the performance indicators are Jaccard index, Dice coefficient and pixel-by-pixel accuracy.
    • In Table 2, TransFuse performs better than UNet++ without any preprocessing or postprocessing, while UNet++ needs to use pre-trained R34 as the backbone

insert image description here

  • Results of Hip Segmentation

    • Hausdorff Distance (HD)
    • Average Surface Distance (ASD)
    • Table 3: Comparison results of hip joint segmentation, mainly fibula, left femur, and right femur need to be segmented. Compared with UNet++ and HRNet, TransFuse is better in HD and ASD, which fully proves that TransFuse proposed in this paper can effectively capture fine structures and generate clearer and more accurate contours.

insert image description here

  • Results of Prostate Segmentation

    • nnUNet is currently the number one segmentation network for prostate segmentation
    • Table 4: Comparison results of TransFuse and nnUMet. It can be seen that compared with nnUNet-3d, TransFuse-S not only has better performance, but also reduces the number of parameters by 41%, and the throughput rate increases by 50%.

insert image description here

  • Ablation Study

    • Table 5: The ablation experiment of the parallel branch, and Table 6 shows the ablation experiment of BiFusion.
    • It can be seen that the performance of CNN and Transformer is the best for the two branches. BiFusion's combination of spatial attention, channel attention, and inner product calculation will improve performance.
      insert image description here

insert image description here

Conclusion

Review comments

Official review comments: https://miccai2021.org/openaccess/paperlinks/2021/09/01/496-Paper0016.html

Rating: 6, 6, 6

Reference: http://t.csdn.cn/d2JR8

Guess you like

Origin blog.csdn.net/wahahaha116/article/details/126624831