【arXiv2306】1M parameters are enough? A lightweight CNN-based model for medical image segmentation

1M parameters are enough? A lightweight CNN-based model for medical image segmentation

Paper: https://arxiv.org/abs/2306.16103

Code: https://github.com/duong-db/U-Lite

Interpretation: The UNet family welcomes the smallest model U-Lite | 800K parameters to achieve performance limit overtaking- Zhihu (zhihu.com)

Summary

Convolutional neural networks (CNNs) and Transformer-based models are widely used in medical image segmentation due to their ability to extract high-level features of images and capture important aspects of images. However, there is often a trade-off between the need for high accuracy and the desire for low computational cost. Models with higher parameters can theoretically achieve better performance, but also result in higher computational complexity and higher memory usage, so they are not practical to implement.

This paper looks for a lightweight U-Net based model that can remain unchanged or even achieve better performance, namely U-Lite. U-Lite is designed based on the principle of depthwise separable convolution, which can not only utilize the strength of CNN, but also reduce a large number of calculation parameters. Specifically, axial depthwise convolution with kernel 7×7 in both encoder and decoder is proposed to enlarge the receptive field of the model. To further improve the performance, several axial atrous depthwise convolutions with 3×3 are used as one of the branches.

Overall, U-Lite contains only 878K parameters, 35 times less than traditional U-Net and many times less than other modern Transformer-based models. Compared to other state-of-the-art architectures, the proposed model reduces a large amount of computational complexity while achieving impressive performance on medical segmentation tasks.

Introduction

U-Net is a classic and efficient medical segmentation model with various variants such as UNet++, ResUNet++, Double UNet, Attention UNet, etc. In recent years, visual Transformer and MLP-Like architecture (MLP) have been widely used. In medical segmentation tasks, TransUNet can be considered as one of the models with higher accuracy and efficiency. Pyramid Vision Transformer (PVT) is used as the Backbone of many high-performance models, such as MSMA-Net, Polyp PVT.

At the same time, the architecture of MLP-Like is also the focus of research. MLPs take advantage of traditional MLPs to encode features along each of their dimensions. AxialAtt MLP Mixer provides very good performance on many medical image datasets by applying axial attention instead of token mixing in MLP Mixer. Different from the neural network, the model based on Transformer or MLP mainly focuses on the global receptive field of the image, so the computational complexity is high and the training process is too heavy.

However, due to the large number of parameters, a large number of studies may bring heavy calculation and slow calculation speed.

For this, some lightweight architecture attempts such as Mobile UNet, DSCA-Net, and MedT can be mentioned. This paper rethinks an efficient lightweight architecture for medical segmentation tasks to further explore a high-performance model that can be effective.

There are three main contributions:

  1. Based on the concept of depthwise separable convolution, the use of axial depthwise convolution modules is proposed. This module helps the model solve every complex architectural problem: expanding the model's receptive field while reducing heavy computational burden.
  2. Propose U-Lite, a lightweight and simple architecture based on CNN. U-Lite is one of the few models that surpasses the recent highly efficient compact network UneXt in terms of performance and number of parameters.
  3. Remarkable results have been achieved on medical segmentation datasets.

U-Lite method

We follow U-Net's symmetric encoder-decoder architecture and design U-Lite in an efficient way so that the model can exploit the strength of CNN while keeping the number of computational parameters as small as possible.

To this end, the paper proposes an axial depth convolution module, as shown in Figure 2. Describing the operation of U-Lite, an input image of shape (3,H,W) is fed to the network through 3 stages: Encoder stage, Bottleneck stage and Decoder stage. U-Lite follows a hierarchical structure, where the encoder extracts features at 6 different levels in the input image.

Bottleneck and decoder are involved in processing these features and upscaling them to the original shape to obtain the segmentation mask. Use skip connections between encoder and decoder. Despite the simple design of U-Lite, the model still performs well on the segmentation task due to the contribution of the axial depthwise convolution module.

Axial Depthwise Convolution module

Swin-Transformer reduces the computational complexity of Transformer by restricting self-attention computation to non-overlapping local windows of size 7×7. ConvNext implements this modification and uses convolutions with a kernel size of 7×7 in the CNN architecture, improving performance. 

Vision Permutator utilizes linear projections to encode feature representations separately along the height and width dimensions.

The paper mentions: What happens if ViT's cross-shaped receptive field is replaced with a local receptive field version, just like what Swin Transformer did to ViT?

Therefore, the author proposes the axial depth convolution module as a combination of Vision Permutator and convolution design. The mathematical formula says:

 Encoder Block and Decoder Block

The encoder and decoder blocks are designed as follows:

  1. Follow depthwise separable convolution architecture. This is an important key to building lightweight models. Depthwise separable convolutions provide the same performance as traditional convolutions while using fewer parameters, thereby reducing computational complexity and making models more compact.
  2. Limit the use of unnecessary ops. Just use normal MaxPooling and UpSampling layers. High parameter-consuming operators such as transposed convolutions are not required. The pointwise convolution operator can play two roles at the same time: encode features along the depth of the feature map while flexibly changing the number of input channels.
  3. Each encoder or decoder block employs a batch normalization layer and ends with a GELU activation function. The authors did a performance comparison between batch normalization and layer normalization, but there is not much difference. GELU is applied because of its demonstrated improvement in accuracy when using GELU compared to ReLU and ELU.

The encoder and decoder structure of U-Lite is shown in Figure 4.

 

Bottleneck Block

In order to further improve the performance of U-Lite, the paper applies an axial expansion depth convolution with a kernel size of 3 to the Bottleneck block (Figure 4). Applied void rate d=1,2,3. The authors use axially dilated convolutions with a kernel size of 3 for two reasons:

  • A kernel of size 3 is better suited to the spatial shape of the underlying features, where the height and width of these features are reduced multiple times,
  • It gives better performance when dilated convolutions with different dilated ratios are used to capture multi-spatial representations of High-Level features in later stages.

To further reduce the number of learnable parameters, a pointwise convolutional layer is employed at the beginning of the bottleneck block. This helps reduce the channel size of the last layer of features before feeding them to the axially dilated depthwise convolution mechanism.

experiment

 

 

 

 

 

Guess you like

Origin blog.csdn.net/m0_61899108/article/details/131924021