MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image Segmentation


foreword

Paper: http://arxiv.org/abs/2303.09975
Code: None

The lack of large-scale annotated medical segmentation datasets makes it challenging to achieve performance comparable to natural image segmentation for medical image segmentation. The convolutional network has a higher inductive bias (for an explanation of the inductive bias, see this blog post [Machine Learning] Talking about Inductive Bias (Inductive Bias) ) (I feel that it means more suitable for processing images), so it is easy to train. performance. Recently, the ConvNeXt architecture attempts to modernize ConvNet by imitating Transformer. In this work, a modern and scalable convolutional architecture is designed, tailored to the challenges of data-scarce medical settings. Propose MedNeXt, a Transformer-based large convolution kernel segmentation network, which introduces
(1) a full ConvNeXt 3D encoder-decoder network for medical image segmentation
(2) residual ConvNeXt upsampling and downsampling blocks, Preserving semantic richness across scales
(3) A new technique to iteratively increase kernel size by upsampling small kernel networks to prevent performance saturation on limited medical data (4) Multiple
levels (depth, width) in MedNeXt , kernel size) for compound scaling.
This results in state-of-the-art performance on 4 tasks across CT and MRI modalities and varying dataset sizes, representing a modern deep architecture for medical image segmentation.


1. Introduction

Transformers are widely used, but are troubled by maximizing performance gains on large annotated datasets due to its limited inductive bias. In order to maintain the inherent inductive bias of convolution, while taking advantage of the structure of Transformer to improve, the recently proposed ConvNext re-establishes the competitive performance of convolutional networks for natural image processing.
The ConvNeXt architecture uses a Transformer-inverted bottleneck, which consists of a depth layer, an expansion layer, and a shrinkage layer. In addition, there is a large-scale deep kernel to replicate long-range representation learning, and training with a huge data set is better than a Transformer-based network. On the contrary, VGGNet stacking small convolution kernels is still the main technique for designing convolutional neural networks in rain and snow image segmentation. Out-of-the-box data-efficient solutions such as nnUNet, which use variants of the standard UNet, remain effective across a wide range of tasks.
ConvNeXt combines the long-range spatial learning capabilities of Vision and Swin Transformer with the inductive bias inherent in ConvNets.
The inverted bottleneck design allows us to expand the width (increase channels) independently of the kernel size.
Proper use can lead to the following advantages:
(1) learning long-range spatial dependencies through large convolution kernels
(2) scaling multiple network levels simultaneously.
Achieving these requires combating the tendency of large networks to overfit to limited training data.
Recently, large convolution kernels have been applied to medical image segmentation, a large kernel 3D-Unet divides convolution kernels into deep kernels and depth-expanded kernels to improve the performance of organ and brain tumor segmentation - exploring kernel scaling while using a constant number layers and channels. 3D-UX-Net uses ConvNext, SwinUNETR's transformers are replaced by ConvNeXt blocks to achieve high performance for multiple segmentation tasks.
But it is only used in standard convolutional codes, which limits its benefits.
In this work we maximize the potential of the ConvNeXt design while uniquely addressing the challenge of limited datasets in medical image segmentation.

contribute:

  1. purely of ConvNeXt blocks
  2. Residual Inverted Bottlenecks
  3. UpKern
  4. Compound Scaling

2. Proposed Method

2.1 Fully ConvNeXt 3D Segmentation Architecture

We exploit these advantages for MedNeXt by adopting the overall design of ConvNeXt as building blocks in a 3D-Unet-like framework. We also use this ConvNeXt block in upsampling and downsampling, and the result is the first medical segmentation architecture to fully use ConvNeXt blocks.
MedNeXt blocks have three layers that mimic the Transformer block, described as a C-channel input.

  1. Depthwise Convolution Layer: This layer contains a Depthwise Convolution, the size of the convolution kernel is kxkxk, followed by a normalization layer (GN GroupNorm), and the output channel is C. The nature of depthwise convolutions allows large kernels in this layer to replicate the large attention windows of SwinTransformer. At the same time limit the calculation, and delegate "heavy lifting" to the expansion layer. (That is, the previous conv3x3, dilation, groups=channels, fuse the spatial information, and the latter conv1x1 fuse the channel information, and its parameter amount is small, the number of channels increases, and the amount of calculation will not increase quickly).

  2. Expansion Layer: Contains an over-complete convolutional layer, the number of output channels is CR, followed by a GELU activation function, the larger the value of R, the network is allowed to scale in the width direction (zoom in?), and the convolution kernel limit of 1x1x1 amount of calculation. This layer effectively decouples the width scaling from the receptive field scaling in the previous layer.
    insert image description here
    (a) Architecture of MedNeXt. The network has 4 encoder layers, decoder layer and 1 bottleneck layer. MedNeXt modules are also present in the upsampling and downsampling layers. Use deep supervision at each decoding layer, and use lower loss weights at lower resolutions (the output of the right decoder on the way is used to calculate the loss). All residuals are accumulated, while convolutions are padded to maintain tensor size.
    insert image description here
    (b) The upsampling kernel ( UpKern ) initializes a pair of similarly configured MedNeXt architectures
    insert image description here
    © MedNeXt's performance on the leaderboard.

  3. Compression Layer: A convolutional layer with 1 × 1 × 1 kernels and C output channels that performs channel compression on feature maps.
    MedNeXt preserves the inductive bias inherent in convolutional neural networks, making training on sparse medical datasets easier. Our full ConvNeXt architecture also supports scaling (more channels) and receptive fields (larger kernels) at standard and upsampled layers. In addition to deep scaling (more layers), we explore these 3 orthogonal types of scaling to design a composite scalable MedNeXt for efficient medical image segmentation.

2.2 Resampling with Residual Inverted Bottlenecks

The original ConvNeXt design uses a separate downsampling layer with standard strided convolution, and the same upsampling layer can also be used with standard strided transpose convolution. But it cannot take advantage of width or kernel-based ConvNeXt scaling when resampling. We improve by extending the Inverted Bottleneck to the resampling block. This is done by inserting strided or transposed convolutions in the first depth layer, respectively, for downsampling and upsampling MedNeXt blocks. MedNeXt takes full advantage of Transformer-like Inverted bottlenecks to preserve rich semantic information in all its components at lower spatial resolution, which should benefit dense medical image segmentation tasks.

2.3 UpKern:Large Kernel Convolutions without Saturation

Large convolution kernels approximate large attention windows in Transformers, but are still prone to performance saturation. With significantly less data volume for medical image segmentation tasks, performance saturation can be a problem in large kernel networks. We borrow the idea of ​​Swin Transformer V2 and train another network with a smaller attention window to initialize a network with a large attention window.
UpKern allows us to iteratively increase the kernel size by trilinearly upsampling convolutional kernels of incompatible sizes (represented as tensors), initializing large kernel networks with compatible pretrained small kernel networks. This provides MedNeXt with a simple yet effective initialization technique that helps large kernel networks overcome performance saturation in relatively limited data scenarios common to medical image segmentation.

2.4 Compound Scaling of Depth, Width and Receptive Field

Simultaneous scaling at multiple levels (depth, width, receptive field, resolution, etc.) provides benefits beyond scaling at one level. The computational requirements to scale kernel sizes infinitely in 3D networks quickly become prohibitive and lead us to investigate different levels of simultaneous scaling. Consistent with Figure 1a, our scaling tests the number of blocks (B), expansion ratio (R) and kernel size (k) – corresponding to depth, width and receptive field size.
We further explore large kernel sizes and experiment with k = {3, 5} for each configuration, maximizing performance through compound scaling of the MedNeXt architecture.

3. Experimental Design

insert image description here
MedNeXt's 5-fold cross-validation CV results under kernel sizes: {3, 5} outperform 7 baselines - including convolutions, transformers, and large kernel networks.

The unified framework provides a common testbed for all networks without favoring one network over patch size, spacing, augmentation, training, and evaluation.

This dataset diversity shows the effectiveness of our method in terms of imaging modality and training set size.


4. Result

  1. The residual inversion bottleneck, especially in the upsampling and downsampling layers, functionally enables MedNeXt (MedNeXt-B resampling vs. standard resampling) for medical image segmentation. Conversely, the absence of these modified blocks results in significantly lower performance. This may be due to the preservation of semantic richness in feature maps while resampling.
  2. Training large kernel networks for medical image segmentation is a difficult task, MedNeXts with large kernels trained from scratch cannot be seen in MedNeXt-B (UpKern vs From Scratch). UpKern improves the performance of kernels 5 × 5 × 5 on BTCV and AMOS22, while without it the performance of large kernels is indistinguishable from that of small kernels.
  3. The performance gain of large kernels is believed to be due to the combination of UpKern with larger kernels, not just a longer effective training schedule (Upkern vs Train 2×), as the trained MedNeXt-B again retrained the kernel 3× 3×3 cannot match its large kernel counterpart.
    This highlights that the MedNeXt modification successfully transfers the ConvNeXt architecture to medical image segmentation. We further determine the performance of the MedNeXt architecture on all 4 datasets against our baselines including convolutional, transformer-based, and large-kernel baselines. We discuss the effectiveness of MedNeXt on several levels.

Summarize

Compared to natural image analysis, medical image segmentation lacks architectures to benefit from extended networks due to inherent domain challenges such as limited training data. MedNeXt is presented with a scalable Transformer-inspired full ConvNeXt 3D segmentation architecture tailored for high performance on limited medical image datasets. We demonstrate the state-of-the-art performance of MedNeXt against 7 strong baselines on 4 challenging tasks. Furthermore, similar to ConvNeXt, for natural images. We present a composite scalable MedNeXt design as an efficient modernization of standard convolutional blocks for building deep networks for medical image segmentation.

Guess you like

Origin blog.csdn.net/goodenough5/article/details/129840902