This article was first published on the WeChat public account CVHub, and may not be reproduced to other platforms in any form. It is only for learning and communication, and offenders will be held accountable!
Title: MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image Segmentation
Paper: https://arxiv.org/pdf/2303.09975.pdf
guide
ConvNeXT
is a pure convolutional model with Vision Transformers
a design inspired by . This paper improves on this to design a modern and scalable convolutional architecture — MedNeXt
, which is specifically optimized for the field of medical image segmentation.
MedNeXt
is an Transformer
inspired large convolutional kernel segmentation network based on a ConvNeXt-like pure 3D codec architecture ( Encoder-Decoder Architecture
). In this work, the authors design an ConvNeXt
upsampling and downsampling block with residuals to preserve semantic richness across scales, while applying a new technique to iterate by upsampling a network of small convolutional kernels Increase kernel
the size to prevent performance saturation with limited medical data.
Finally, by scaling MedNeXt
different scales ( depth
, width
, ) of the architecture, the method in this paper achieves state-of-the-art performance on four tasks with and modalities and different dataset sizes, which can be described as one of the modern deep architectures in the field of medical image segmentation. king!kernel size
CT
MRI
It is worth mentioning that this article was
nnUNet
created by the original research team, and its conclusions will be very convincing and entertaining!
background
Difficulties and challenges
The current deep learning algorithm can achieve better performance on more specific and specific tasks (such as the segmentation of brain tumors), but it may not be able to achieve better performance on some unseen tasks (such as liver blood vessel segmentation). Generalization performance. If there is an algorithm that can adaptively complete the segmentation of all unprocessed tasks without manual intervention given certain training data, it will be of great significance to the medical aided diagnosis system.
Below, let us first look at the more common difficulties and challenges in medical image segmentation tasks:
- Due to the high cost of data label acquisition, the amount of data is often small
- Problems with sample imbalance are common due to disease characteristics (e.g., prevalence, lesions usually occupy a small portion of an organ)
- The segmentation accuracy of small targets is of greater clinical value but at the same time it is more difficult to accurately segment
- The impact of different acquisition devices on data
Although tens of thousands of new technologies are generated in the field of medical image segmentation every year SOTA
, due to the above reasons, the existing network architectures used in the field of medical image segmentation are often not universal and robust. A real SOTA
architecture should have the following conditions:
- State-of-the-art algorithms accurately perform medical image segmentation tasks and achieve good generalization performance on new, untrained datasets
- Algorithms can achieve consistent state-of-the-art performance in a variety of different segmentation tasks
- The current accurate trained model has been commercialized and provided to non-AI professionals.
nnUNet
nnUNet
It is well-known in the early stage that one year participated in the decathlon competition (MSD) in the field of medical image segmentation and won the championship:
The MSD Challenge contains a total of ten data sets, of which the first stage is: segmentation of brain, heart, hippocampus, liver, lung, pancreas, and prostate, while the second stage includes segmentation of colon, liver vessels, and spleen .
It should be said nnU-Net
that it is not a new network architecture, but a Trick
medical image segmentation framework that integrates multiple and data preprocessing and postprocessing. Its core idea is based on the analysis of the training data set, using a fully automatic dynamic adaptive segmentation and synthesis solution to better complete each task. The author believes that image preprocessing, network topology and postprocessing are more important than network structure, and these most important parts of nnU-Net
the framework can be automatically determined (adaptive tuning).
nnU-Net
The network structure is based on U-Net
the schema with the following modifications:
- use
leaky ReLU
insteadReLU
; - instance normalization instead of regular
BN
; - Step convolution for downsampling;
In addition, in terms of data augmentation, affine transformation , nonlinear deformation , intensity transformation (similar to gamma correction), mirroring along all axes , and random cropping are mainly used .
In terms of loss function, the combination and cross entropy are used as a combination, and the optimizer Dice
is used at the same time . Finally, the authors devise an ensemble strategy that leverages cross-validation on the training set to automatically find the optimal combination choice for a specific task by using four different architectures.Adam
Of course, the above are very early improvements. After several years of continuous iteration, I believe
nnUNet
it has developed even stronger. Thankfully, the authors have been updating and maintaining this repository as well.
In the two years after this challenge, nnU-Net
I have participated in dozens of hundreds of segmentation tasks. The method wins top results in several tasks. For example, I was impressed by winning the prestigious BraTS Challenge 2020. nnU-Net
The success of ϵ confirms our hypothesis that methods that perform well on multiple tasks will generalize well to previously unseen tasks and potentially outperform custom-designed solutions. In the subsequent development, nnU-Net
it has become the most advanced medical image segmentation method and has been used by other researchers for multiple segmentation challenges. For example, among the top 15 algorithms in the Kidney and Kidney Tumor Segmentation Challenge 2019, 8 were based on nnU-Net
improved methods. In COVID-19
the lung CT lesion segmentation challenge, nine of the top ten algorithms were by nnU-Net
design. And 90% of the 2020 winners of various challenges nnU-Net
built solutions on top of it. etc.
Therefore, such a perfect Maxima (nnUNet) must require a Bole ( MedNeXt
) to match it. Now let us formally introduce our method.
method
Fully ConvNeXt 3D Segmentation Architecture
As we all know, many important design ideas have been ConvNeXt
inherited , and its core idea is to limit the computational cost while increasing the receptive field of the network to learn global features. Transformers
In this work, the author ConvNeXt
proposes a 3D-UNet
new network architecture similar to the one based on the design pattern of MedNeXt
:
The above picture MedNeXt
is the overall network architecture diagram. As shown in the figure, the network has 4 symmetric encoder and decoder layers each (really very U-Net
), with a bottleneck layer embedded in the middle. where MedNeXt
blocks are also included in each upsampling and downsampling layer. Each decoder layer uses deep supervision with lower loss weights at lower resolutions.
For the idea of deep supervision, please refer to the official selection of CVHub
Multi-task learning
:
It can be seen that the design idea of the author's deep supervision mechanism is exactly the same as the idea proposed by the author before. The following focuses on the next MedNeXt
module (such as the yellow part of the above frame diagram), which is a similar ConvNeXt block
new module, which includes the following three important components, namely Depthwise Convolution Layer
, Expansion Layer
and Compression Layer
. These three parts are introduced separately below.
Depthwise Convolution Layer
DW
The layer mainly consists of a depthwise convolution with kernel size k × k × k, followed by normalization, with C output channels. Here, the author uses channel-wise GroupNorm
to stabilize BatchSize
the potential impact of smaller , rather than using ConvNeXt
or Transformer
commonly used in most architectures LayerNorm
. This is due to the deep nature of convolution that will allow large attention windows replicated by large convolution kernels in this layer Swin-Transformers
, while effectively limiting the amount of computation.
Expansion Layer
For the similar design Transformers
of the benchmark , the expansion layer is mainly used for channel scaling, where R is the expansion ratio, and GELU
an activation function is introduced at the same time. It should be noted that a large value of R allows the network to expand horizontally and uses a 1 × 1 × 1 convolution kernel to limit the amount of computation. Thus, this layer effectively binds width scaling to the depth of the receptive field in the previous layer.
Compression Layer
At the end of the module is a convolution layer with 1 × 1 × 1 convolution kernel and C output channels to perform channel-wise compression of feature maps.
In general, MedNeXt
it is based on a pure convolutional architecture, which preserves the ConvNets
inherent inductive bias ( inductive bias
), which makes it easier to train on sparse medical datasets. In addition, in the ConvNeXt
same way, in order to better scale and expand the overall network architecture, this paper designs a scaling strategy with 3 orthogonal types to achieve more efficient and Lupine medical image segmentation performance for different datasets.
Resampling with Residual Inverted Bottlenecks
The original ConvNeXt
architecture utilizes an independent downsampling layer consisting of standard strided convolutions. The opposite is to apply transposed convolution for upsampling. However, this modest design does not fully exploit the advantages of the architecture. Therefore, this paper improves on this by extending the inverted bottleneck layer to MedNeXt
the resampling block in .
In terms of specific implementation, it can be done by inserting stride convolution or transposed convolution respectively in the first DW
layer to realize blocks that can complete upsampling and downsampling MedNeXt
, as shown in the green and blue parts of the above figure. Furthermore, to make gradient flow easier, the authors add residual connections with 1 × 1 × 1 convolutions or transposed convolutions with a stride of 2. In this way. It can make full use of the advantages of Transformer-like inversion bottlenecks, retain richer semantic information and required spatial resolution at a lower computational cost, which is very beneficial to dense predictive medical image segmentation tasks.
UpKern: Large Kernel Convolutions without Saturation
Everyone knows that increasing the size of the convolution kernel means increasing the receptive field of the network to effectively improve the performance of the network. However, it should be noted that this is only a theoretical receptive field, not an actual receptive field. So sad~~~
Therefore, a lot of work has recently been exploring the magic of large convolution kernels. According to the author's limited knowledge reserve, the highest limit I have seen so far is to expand to . Interested readers can go to 61 x 61
"CVHub" by themselves. Discussing back to ConvNeXt
itself, the limit of its convolution kernel is only reached 7 x 7
, and according to the original work, it will be "saturated" if it increases further. So, for tasks such as medical image segmentation where data is scarce, how can we effectively apply and leverage the advantages of this architecture? Here's how the author did it.
To solve this problem, the authors first borrow Swin Transformer V2
inspiration from [20], where a network with a large attention window is initialized with another network trained with a smaller attention window. In addition, the authors propose to spatially interpolate the existing bias matrix to a larger size as a pre-training step instead of training from scratch, and subsequent experiments have fully verified the effectiveness of this method.
As shown in the figure above, the author has made a corresponding "customization" for the convolution kernel to overcome the performance saturation problem. where, UpKern
allows us to iteratively increase kernels by initializing a large kernel network with a compatible pretrained small kernel network by 3D linear upsampling of kernels of incompatible sizes (represented as tensors). size. All other layers (including the normalization layer) with the same tensor size are initialized by copying the unchanged pretrained weights.
In summary, the above operations bring a simple but effective initialization technique to MedNeXt, which can help large convolutional kernel networks overcome performance saturation in relatively limited data scenarios common in medical image segmentation.
Compound Scaling of Depth, Width and Receptive Field
The following table shows the performance of the author using compound scaling strategies to maximize performance for different dimensions:
It can be seen that compared with the conventional upsampling and downsampling modules, the method in this paper can better adapt to different tasks.
experiment
In this paper AMOS22
, BTCV
related experiments are carried out on the dataset to demonstrate the effectiveness of the proposed method, and it also proves that the direct application of ordinary architectures can not beat ConvNeXt
the existing segmentation baselines (eg ).nnUNet
As can be seen from the above table, MedNeXt
SOTA performance has been achieved for the four existing mainstream medical image segmentation datasets without additional training data. Despite task heterogeneity (brain and kidney tumors, organs), modality (CT, MRI) and training set size (BTCV: 18 samples vs BraTS21: 1000 samples), MedNeXt-L consistently outperforms the state-of-the-art algorithm, eg nnUNet
. In addition, with UpKern and 5 × 5 × 5 convolutional kernels, MedNeXt further improves its own network with full compound scaling, making comprehensive improvements in organ segmentation (BTCV, AMOS) and tumor segmentation (KiTS19, BraTS21).
Furthermore, on the official website Leaderboard
, MedNeXt handily beats BTCV
the task nnUNet
. Notably, this is by far one of the leading methods trained only with supervision and without additional training data (DSC: 88.76, HD95: 15.34). Likewise, for AMOS22
the dataset, MedNeXt
not only surpassed nnUNet
, but has been occupying the top spot (DSC: 91.77, NSD: 84.00)! Finally, MedNeXt
in the other two data sets, namely KITS19
and BraTS21
both achieved good performance, all thanks to its excellent architecture design.
Summarize
Due to inherent domain challenges such as limited training data compared to natural image tasks, medical image segmentation lacks architectures (such as ) that benefit from scaling networks ConvNeXt
. This paper presents a highly scalable class ConvNeXt
of 3D segmentation architectures that outperform seven other top-of-the-line methods on limited medical image datasets, including very strong nnUNet
. MedNeXt
Designed as an effective replacement for standard convolution blocks, it can be used as a benchmark for new network architectures in the field of medical image segmentation!
write at the end
If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions! Welcome to add the editor's WeChat account: cv_huber, let's discuss more interesting topics together!