It's so hot! The new work of nnUNet research team | MedNeXt: The King of the New Generation Segmentation Architecture, Refreshing Multiple List Records!

This article was first published on the WeChat public account CVHub, and may not be reproduced to other platforms in any form. It is only for learning and communication, and offenders will be held accountable!

Title: MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image Segmentation

Paper: https://arxiv.org/pdf/2303.09975.pdf

guide

ConvNeXTis a pure convolutional model with Vision Transformersa design inspired by . This paper improves on this to design a modern and scalable convolutional architecture — MedNeXt, which is specifically optimized for the field of medical image segmentation.

MedNeXtis an Transformerinspired large convolutional kernel segmentation network based on a ConvNeXt-like pure 3D codec architecture ( Encoder-Decoder Architecture). In this work, the authors design an ConvNeXtupsampling and downsampling block with residuals to preserve semantic richness across scales, while applying a new technique to iterate by upsampling a network of small convolutional kernels Increase kernelthe size to prevent performance saturation with limited medical data.

Finally, by scaling MedNeXtdifferent scales ( depth, width, ) of the architecture, the method in this paper achieves state-of-the-art performance on four tasks with and modalities and different dataset sizes, which can be described as one of the modern deep architectures in the field of medical image segmentation. king!kernel sizeCTMRI

It is worth mentioning that this article was nnUNetcreated by the original research team, and its conclusions will be very convincing and entertaining!

background

Difficulties and challenges

The current deep learning algorithm can achieve better performance on more specific and specific tasks (such as the segmentation of brain tumors), but it may not be able to achieve better performance on some unseen tasks (such as liver blood vessel segmentation). Generalization performance. If there is an algorithm that can adaptively complete the segmentation of all unprocessed tasks without manual intervention given certain training data, it will be of great significance to the medical aided diagnosis system.

Below, let us first look at the more common difficulties and challenges in medical image segmentation tasks:

  • Due to the high cost of data label acquisition, the amount of data is often small
  • Problems with sample imbalance are common due to disease characteristics (e.g., prevalence, lesions usually occupy a small portion of an organ)
  • The segmentation accuracy of small targets is of greater clinical value but at the same time it is more difficult to accurately segment
  • The impact of different acquisition devices on data

Although tens of thousands of new technologies are generated in the field of medical image segmentation every year SOTA, due to the above reasons, the existing network architectures used in the field of medical image segmentation are often not universal and robust. A real SOTAarchitecture should have the following conditions:

  • State-of-the-art algorithms accurately perform medical image segmentation tasks and achieve good generalization performance on new, untrained datasets
  • Algorithms can achieve consistent state-of-the-art performance in a variety of different segmentation tasks
  • The current accurate trained model has been commercialized and provided to non-AI professionals.

nnUNet

nnUNetIt is well-known in the early stage that one year participated in the decathlon competition (MSD) in the field of medical image segmentation and won the championship:

MSD Challenge

The MSD Challenge contains a total of ten data sets, of which the first stage is: segmentation of brain, heart, hippocampus, liver, lung, pancreas, and prostate, while the second stage includes segmentation of colon, liver vessels, and spleen .

It should be said nnU-Netthat it is not a new network architecture, but a Trickmedical image segmentation framework that integrates multiple and data preprocessing and postprocessing. Its core idea is based on the analysis of the training data set, using a fully automatic dynamic adaptive segmentation and synthesis solution to better complete each task. The author believes that image preprocessing, network topology and postprocessing are more important than network structure, and these most important parts of nnU-Netthe framework can be automatically determined (adaptive tuning).

nnU-NetThe network structure is based on U-Netthe schema with the following modifications:

  1. use leaky ReLUinstead ReLU;
  2. instance normalization instead of regular BN;
  3. Step convolution for downsampling;

In addition, in terms of data augmentation, affine transformation , nonlinear deformation , intensity transformation (similar to gamma correction), mirroring along all axes , and random cropping are mainly used .
In terms of loss function, the combination and cross entropy are used as a combination, and the optimizer Diceis used at the same time . Finally, the authors devise an ensemble strategy that leverages cross-validation on the training set to automatically find the optimal combination choice for a specific task by using four different architectures.Adam

Of course, the above are very early improvements. After several years of continuous iteration, I believe nnUNetit has developed even stronger. Thankfully, the authors have been updating and maintaining this repository as well.

In the two years after this challenge, nnU-NetI have participated in dozens of hundreds of segmentation tasks. The method wins top results in several tasks. For example, I was impressed by winning the prestigious BraTS Challenge 2020. nnU-NetThe success of ϵ confirms our hypothesis that methods that perform well on multiple tasks will generalize well to previously unseen tasks and potentially outperform custom-designed solutions. In the subsequent development, nnU-Netit has become the most advanced medical image segmentation method and has been used by other researchers for multiple segmentation challenges. For example, among the top 15 algorithms in the Kidney and Kidney Tumor Segmentation Challenge 2019, 8 were based on nnU-Netimproved methods. In COVID-19the lung CT lesion segmentation challenge, nine of the top ten algorithms were by nnU-Netdesign. And 90% of the 2020 winners of various challenges nnU-Netbuilt solutions on top of it. etc.

Therefore, such a perfect Maxima (nnUNet) must require a Bole ( MedNeXt) to match it. Now let us formally introduce our method.

method

Fully ConvNeXt 3D Segmentation Architecture

As we all know, many important design ideas have been ConvNeXtinherited , and its core idea is to limit the computational cost while increasing the receptive field of the network to learn global features. TransformersIn this work, the author ConvNeXtproposes a 3D-UNetnew network architecture similar to the one based on the design pattern of MedNeXt:

MedNeXt macro and block architecture

The above picture MedNeXtis the overall network architecture diagram. As shown in the figure, the network has 4 symmetric encoder and decoder layers each (really very U-Net), with a bottleneck layer embedded in the middle. where MedNeXtblocks are also included in each upsampling and downsampling layer. Each decoder layer uses deep supervision with lower loss weights at lower resolutions.

For the idea of ​​deep supervision, please refer to the official selection of CVHub Multi-task learning:

LibMTL: A PyTorch library for multi-task learning

It can be seen that the design idea of ​​the author's deep supervision mechanism is exactly the same as the idea proposed by the author before. The following focuses on the next MedNeXtmodule (such as the yellow part of the above frame diagram), which is a similar ConvNeXt blocknew module, which includes the following three important components, namely Depthwise Convolution Layer, Expansion Layerand Compression Layer. These three parts are introduced separately below.

Depthwise Convolution Layer

DWThe layer mainly consists of a depthwise convolution with kernel size k × k × k, followed by normalization, with C output channels. Here, the author uses channel-wise GroupNormto stabilize BatchSizethe potential impact of smaller , rather than using ConvNeXtor Transformercommonly used in most architectures LayerNorm. This is due to the deep nature of convolution that will allow large attention windows replicated by large convolution kernels in this layer Swin-Transformers, while effectively limiting the amount of computation.

Expansion Layer

For the similar design Transformersof the benchmark , the expansion layer is mainly used for channel scaling, where R is the expansion ratio, and GELUan activation function is introduced at the same time. It should be noted that a large value of R allows the network to expand horizontally and uses a 1 × 1 × 1 convolution kernel to limit the amount of computation. Thus, this layer effectively binds width scaling to the depth of the receptive field in the previous layer.

Compression Layer

At the end of the module is a convolution layer with 1 × 1 × 1 convolution kernel and C output channels to perform channel-wise compression of feature maps.

In general, MedNeXtit is based on a pure convolutional architecture, which preserves the ConvNetsinherent inductive bias ( inductive bias), which makes it easier to train on sparse medical datasets. In addition, in the ConvNeXtsame way, in order to better scale and expand the overall network architecture, this paper designs a scaling strategy with 3 orthogonal types to achieve more efficient and Lupine medical image segmentation performance for different datasets.

Resampling with Residual Inverted Bottlenecks

The original ConvNeXtarchitecture utilizes an independent downsampling layer consisting of standard strided convolutions. The opposite is to apply transposed convolution for upsampling. However, this modest design does not fully exploit the advantages of the architecture. Therefore, this paper improves on this by extending the inverted bottleneck layer to MedNeXtthe resampling block in .

In terms of specific implementation, it can be done by inserting stride convolution or transposed convolution respectively in the first DWlayer to realize blocks that can complete upsampling and downsampling MedNeXt, as shown in the green and blue parts of the above figure. Furthermore, to make gradient flow easier, the authors add residual connections with 1 × 1 × 1 convolutions or transposed convolutions with a stride of 2. In this way. It can make full use of the advantages of Transformer-like inversion bottlenecks, retain richer semantic information and required spatial resolution at a lower computational cost, which is very beneficial to dense predictive medical image segmentation tasks.

UpKern: Large Kernel Convolutions without Saturation

Everyone knows that increasing the size of the convolution kernel means increasing the receptive field of the network to effectively improve the performance of the network. However, it should be noted that this is only a theoretical receptive field, not an actual receptive field. So sad~~~

Therefore, a lot of work has recently been exploring the magic of large convolution kernels. According to the author's limited knowledge reserve, the highest limit I have seen so far is to expand to . Interested readers can go to 61 x 61"CVHub" by themselves. Discussing back to ConvNeXtitself, the limit of its convolution kernel is only reached 7 x 7, and according to the original work, it will be "saturated" if it increases further. So, for tasks such as medical image segmentation where data is scarce, how can we effectively apply and leverage the advantages of this architecture? Here's how the author did it.

To solve this problem, the authors first borrow Swin Transformer V2inspiration from [20], where a network with a large attention window is initialized with another network trained with a smaller attention window. In addition, the authors propose to spatially interpolate the existing bias matrix to a larger size as a pre-training step instead of training from scratch, and subsequent experiments have fully verified the effectiveness of this method.

Upsampled Kernel (UpKern) & Performance

As shown in the figure above, the author has made a corresponding "customization" for the convolution kernel to overcome the performance saturation problem. where, UpKernallows us to iteratively increase kernels by initializing a large kernel network with a compatible pretrained small kernel network by 3D linear upsampling of kernels of incompatible sizes (represented as tensors). size. All other layers (including the normalization layer) with the same tensor size are initialized by copying the unchanged pretrained weights.

In summary, the above operations bring a simple but effective initialization technique to MedNeXt, which can help large convolutional kernel networks overcome performance saturation in relatively limited data scenarios common in medical image segmentation.

Compound Scaling of Depth, Width and Receptive Field

The following table shows the performance of the author using compound scaling strategies to maximize performance for different dimensions:

It can be seen that compared with the conventional upsampling and downsampling modules, the method in this paper can better adapt to different tasks.

experiment

In this paper AMOS22, BTCVrelated experiments are carried out on the dataset to demonstrate the effectiveness of the proposed method, and it also proves that the direct application of ordinary architectures can not beat ConvNeXtthe existing segmentation baselines (eg ).nnUNet

As can be seen from the above table, MedNeXtSOTA performance has been achieved for the four existing mainstream medical image segmentation datasets without additional training data. Despite task heterogeneity (brain and kidney tumors, organs), modality (CT, MRI) and training set size (BTCV: 18 samples vs BraTS21: 1000 samples), MedNeXt-L consistently outperforms the state-of-the-art algorithm, eg nnUNet. In addition, with UpKern and 5 × 5 × 5 convolutional kernels, MedNeXt further improves its own network with full compound scaling, making comprehensive improvements in organ segmentation (BTCV, AMOS) and tumor segmentation (KiTS19, BraTS21).

Furthermore, on the official website Leaderboard, MedNeXt handily beats BTCVthe task nnUNet. Notably, this is by far one of the leading methods trained only with supervision and without additional training data (DSC: 88.76, HD95: 15.34). Likewise, for AMOS22the dataset, MedNeXtnot only surpassed nnUNet, but has been occupying the top spot (DSC: 91.77, NSD: 84.00)! Finally, MedNeXtin the other two data sets, namely KITS19and BraTS21both achieved good performance, all thanks to its excellent architecture design.

Summarize

Due to inherent domain challenges such as limited training data compared to natural image tasks, medical image segmentation lacks architectures (such as ) that benefit from scaling networks ConvNeXt. This paper presents a highly scalable class ConvNeXtof 3D segmentation architectures that outperform seven other top-of-the-line methods on limited medical image datasets, including very strong nnUNet. MedNeXtDesigned as an effective replacement for standard convolution blocks, it can be used as a benchmark for new network architectures in the field of medical image segmentation!

write at the end

If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions! Welcome to add the editor's WeChat account: cv_huber, let's discuss more interesting topics together!

Guess you like

Origin blog.csdn.net/CVHub/article/details/129741565