LeViT-UNet: Effective integration of transformer encoder and CNN decoder

Levi-Unet [2] is a new medical image segmentation architecture that uses a transformer as an encoder, which enables it to learn long-range dependencies more efficiently. Levi-Unet [2] is faster than conventional U-Nets while still achieving state-of-the-art segmentation performance.

Levi-Unet [2] achieves better performance than other methods on several challenging medical image segmentation benchmarks, including Synapse Multi-Organ Segmentation Dataset (Synapse) and Automated Cardiac Diagnosis Challenge Dataset (ACDC).

LeViT-UNet Architecture

Levi-unet's encoder is built using LeViT blocks, designed to efficiently and effectively learn global features. The decoder is built using convolutional blocks.

The encoder extracts feature maps from input images at multiple resolutions. These feature maps are upsampled, concatenated and then passed to the decoder via skip connections. Skip connections allow the decoder to access high-resolution local features from the encoder, helping to improve segmentation performance.

This design enables the model to combine the advantages of transformer and cnn. Transformer is just good at learning global features, while cnn is good at learning local features. By combining these two approaches, Levi-Unet is able to achieve good segmentation performance while being relatively efficient.

LeViT encoder

The encoder adopts LeViT [1], which mainly consists of two parts: convolution block and transformer block. The convolutional block performs resolution reduction by applying 4 layers of 3x3 convolutions (with stride 2) to the input image. This cuts the resolution of the image in half while extracting more abstract features. Then the transformer block takes the feature maps of the convolutional blocks and learns global features.

The features from the convolutional block and the transformer block are concatenated in the final stage of the encoder. This makes the encoder both local and global. Local features are important for recognizing small and detailed objects in an image, while global features are important for recognizing the overall structure of an image. By combining local and global features, the encoder is able to generate more accurate segmentations.

According to the number of channels input to the first transformer block, 3 LeViT encoders are developed: Levi-128s, Levi-192 and Levi-384.

CNN decoder

Levi-Unet's decoder concatenates features from the encoder with skip connections. It enables the decoder to access high-resolution local features from the encoder, and employs a cascaded upsampling strategy to recover the resolution from the previous layer using a CNN. It consists of a series of upsampling layers, each followed by two 3x3 convolutional layers, a BN and a ReLU layer.

Experimental results

Implementation details: data augmentation (random flip and rotation), optimizer (Adam, learning rate 1e-5, weight decay 1e-4), image size 224x224, batch size 8, epoch 350 and 400 for Synapse and ACDC datasets

The LeViT model outperforms existing models and is significantly faster than TransUNet, which incorporates Transformer blocks into CNNs.

The figure above shows the qualitative segmentation results of four different methods: TransUNet, UNet, DeepLabv3+ and levi-UNet. The other three methods are more likely to result in organ undersegmentation or oversegmentation. For example, the stomach is under-segmented by TransUNet and DeepLabV3+ (shown by the red arrow in the third panel in the upper row), and over-segmented by UNet (shown by the red arrow in the fourth panel in the second row).

Compared with other methods, the output of the model proposed in the paper is relatively smooth, indicating that it has more advantages in boundary prediction.

2 papers:

[1] Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herv’e J’egou, Matthijs Douze, LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference, 2021

[2] Guoping Xu, Xingrong Wu, Xuan Zhang, Xinwei He, LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation, 2021

https://avoid.overfit.cn/post/474870d5912d4cb3aeade0b47c1a97e3

Author: Golnaz Hosseini

Supongo que te gusta

Origin blog.csdn.net/m0_46510245/article/details/131529775
Recomendado
Clasificación