IPMI 2023 Hong Kong University of Science and Technology Chen Hao's team's new work | CTO: Rethinking the role of boundary detection in medical image segmentation

This article was first published on the WeChat public account CVHub. Private reprinting or selling to other platforms is strictly prohibited, and offenders will be held accountable.

Title: Rethinking Boundary Detection in Deep Learning Models for Medical Image Segmentation
Paper: https://arxiv.org/pdf/2305.00678.pdf
Code: https://github.com/xiaofang007/CTO

guide

In this paper, we propose a novel network architecture CTO, namely Convolution, Transformerand Operator, to achieve high-precision image segmentation with an optimal balance between accuracy and efficiency by combining convolutional neural networks, visual Transformers, and explicit boundary detection operations.

CTO follows the standard encoder-decoder segmentation paradigm, where the encoder network adopts the popular CNN backbone structure to capture local semantic information, and uses a lightweight ViT auxiliary network to integrate long-range dependencies. To enhance the boundary learning ability, this paper further proposes a boundary-guided decoder network, which uses the boundary mask obtained from a dedicated boundary detection operation as explicit supervision to guide the decoding learning process.

The method is evaluated on six challenging medical image segmentation datasets, and the results show that CTO achieves state-of-the-art accuracy while being competitive in model complexity.

background

Operators are the basic components of traditional digital image processing, among which the boundary detection operator is the core element, and it is also the core point of this paper. Commonly used boundary detection operators can be divided into two categories:

  • First derivative operators (such as Roberts, Prewitt, and Sobel)
  • second derivative operator (e.g. Laplacian)

In recent years, boundary detection operators have also been widely used in pixel-level computer vision tasks, such as manipulation detectionand camouflaged object detection domains. In this paper, the boundary detection operator is used as an explicit mask extractor to guide an implicit feature learning model for medical image segmentation, and its contribution lies in utilizing the feature maps of intermediate layers to synthesize high-quality boundary predictions without additional information.

method

Framework

CTO

As shown in the figure above, CTO follows the encoder-decoder paradigm and employs skip connections to aggregate low-level features from the encoder into the decoder. The encoder network is composed of mainstream CNN and auxiliary ViT. The decoder network employs a boundary detection operator to guide its learning process.

For the encoder, the authors design a two-stream encoder that combines a convolutional neural network and a lightweight visual Transformer to capture image local feature dependencies and long-range feature dependencies between image patches, respectively. This combination does not incur much computational overhead.

For the decoder, an operator-guided decoder is adopted, which uses boundary detection operators (for example Sobel) to guide the learning process through the generated boundary masks, and the whole model is trained in an end-to-end manner.

Dual-Stream Encoder

The Mainstream Convolution Stream

To capture local feature dependencies, CTO first constructs a convolutional flow. This paper chooses the powerful and efficient Res2Netas the backbone network, which consists of a convolutional stem and four residual blocks.

Res2Net is a convolutional neural network variant published by the team of Professor Cheng Mingming of Nankai in the early years, aiming to enhance the receptive field and feature representation ability of the network. It introduces the concept of multi-scale receptive fields by redesigning the connection method in the residual module to improve the performance of the network. In traditional residual modules, features are usually on the same scale as they are passed from one module to the next. However, Res2Net introduces a new structural unit called "Res2Block", which introduces multiple branches inside the module, each with a different receptive field. This multi-branch structure can capture features at different scales, thereby enhancing the network's ability to represent information at different scales. This design enables the network to better capture the details and global context information in images, thus improving the performance of image analysis and computer vision tasks.

The Assistant Transformer Stream

Second, the authors design an auxiliary flow based on a lightweight visual Transformer. LightViTIt aims to capture the long-range feature dependencies between image patches of different scales. Specifically, it consists of multiple parallel lightweight Transformer blocks that receive feature blocks of different scales as input. All Transformer blocks share a similar structure, including block embedding layers and Transformer encoding layers.

The block embedding layer of LightViT is used to convert the input feature block into an embedding vector, and convert the spatial dimension into the sequence dimension. In this way, each feature block can be regarded as a sequence and processed in the Transformer module. Next, the Transformer encoding layer is used to model the feature blocks with a self-attention mechanism to capture the long-range dependencies between different feature blocks. By introducing a self-attention mechanism in the Transformer module, LightViT can effectively model the interactions among feature blocks to extract the global context information of images.

LightViT is designed such that the network can capture the long-range feature dependencies between image patches at different scales, thus boosting the performance of image analysis tasks. Thanks to the lightweight Transformer block, LightViT reduces the computational and storage overhead of the model while maintaining efficient performance. This makes LightViT an effective tool for applications such as medical image analysis.

Boundary-Guided Decoder

The boundary-guided decoder uses a gradient operator module to extract boundary information of foreground objects. Then, through the boundary optimization module, the boundary-enhanced features are integrated with the features of the multi-level encoder, aiming to simultaneously characterize the intra-class and inter-class consistency in the feature space, and enrich the representation ability of the features. This approach enables the decoder to better utilize boundary information when generating segmentation results, resulting in more accurate segmentation results.

Boundary Enhanced Module (BEM)

The boundary optimization module uses high-level features and low-level features as input, extracts boundary information and filters out information irrelevant to the boundary. In order to achieve this goal, the author in the horizontal direction G x G_{x}Gxand vertical direction G y G_{y}GySobelOperators are applied to obtain gradient maps. Specifically, this paper uses two 3 × 3 3\times 33×A parameter fixed convolution of 3 , and a convolution operation with a stride of 1 is applied. These two convolutions are defined as:

We then apply these two convolutions to the input feature map, resulting in a gradient map M x M_{x}Mx M y M_{y} My. Next, the gradient map is normalized by sigmoidthe function , and then fused with the input feature map to obtain the enhanced edge feature map F e F_{e}Fe

Among them, the circle number represents element-wise multiplication, σ \sigmaσ represents the sigmoid function,M xy M_{xy}Mxyis the M x M_{x}Mx M y M_{y} MyStitching along the channel dimension. Then, we can directly fuse the edge-enhanced feature maps using simple stacked convolutional layers. Finally, the output feature map is supervised by the GT boundary map, which removes edge features inside objects and produces boundary enhanced features.

Boundary Inject Module (BIM)

The boundary enhancement features we obtained through BEM in the previous step can be used as prior knowledge to improve the image representation ability of the features generated by the encoder. Next, this paper proposes BIM, which introduces a dual-path boundary fusion scheme to facilitate the representation ability of foreground and background features. Specifically, BIM receives two inputs: a channel-level connection of boundary-enhanced features with corresponding features from the encoder network, and features from the previous decoder layer. These two inputs are then fed into the BIM, which contains two separate paths to facilitate feature representation for foreground and background, respectively. For the foreground path, we concatenate the two inputs directly along the channel dimension, and then apply a series of Conv-BN-ReLU (convolution, batch normalization, ReLU activation) layers to obtain foreground features. For the background path, a background attention component is designed to selectively focus on background information.

Loss Function

CTOIs a multi-task model, including interior and boundary segmentation, this paper defines an overall loss function to jointly optimize these two tasks:

The overall loss consists of the main inner segmentation loss L seg L_{seg}Lsee gand boundary loss L bnd L_{bnd}Lbndcomposition. It should be noted that in the boundary detection loss, only the predictions from the BEM are considered, and this module takes the high-level feature maps of the encoder and the low-level feature maps as input. As for the main image segmentation loss, the authors employ a deep supervision strategy to obtain predictions from features at different levels of the decoder.

Interior Segmentation Loss

L s e g L_{seg} Lsee gis the cross-entropy loss LCE L_{CE}LCEAnd the average intersection ratio mIoU loss L m I o U L_{mIoU}Lm I or UThe weighted sum of:

Boundary Loss

Boundary loss L bnd L_{bnd}LbndConsidering the class imbalance between foreground and background pixels in boundary detection, Dice loss is adopted:

experiment

This article will perform experimental comparisons CTOwith multiple SOTA methods including U-Net, ResUNet, VNet, ViT, TransUNetand Swin-Uneton the following mainstream benchmark datasets.

ISIC 2016 & PH2

CTO achieves 91.89% on Dice coefficient and 85.18% on IoU, which are 0.05% and 0.88% higher than the state-of-the-art methods, respectively.

ISIC 2018

With 5-fold cross-validation, CTO achieves 91.2% on the Dice coefficient and 84.5% on the IoU metric, which are 1.8% and 2.3% higher than the state-of-the-art methods, respectively. Moreover, CTO achieves 91.50% and 84.59% on Dice and 84.59% on IoU on the LiTS17 dataset, outperforming the state-of-the-art methods by 0.26% and 0.45%, respectively.

CoNIC

3D MISeg

It can be seen that on the BTCV dataset, CTO achieves 81.10% on Dice and 18.75% on HD, surpassing the state-of-the-art methods. Especially on organs with blurred boundaries, such as “pancreas” and “stomach”, the model achieves significant gains on Dice, which are 4.70% and 3.60%, respectively. Notably, CTO outperforms in terms of model efficiency, with comparable FLOPs and parameters, while achieving competitive performance improvements.

Summarize

This study proposes a CTOnovel network architecture named , for medical image segmentation. Compared with advanced medical image segmentation architectures, CTO achieves a better balance between recognition accuracy and computational efficiency. The contribution of this paper is to utilize intermediate feature maps to synthesize high-quality boundary-supervised masks without additional information. Through experiments on six publicly available datasets, CTO outperforms the state-of-the-art methods and verifies the effectiveness of its individual components.


CVHub is a high-quality knowledge sharing platform focusing on the field of computer vision. The original rate of technical articles on the whole site reaches 99%. It presents you with comprehensive, multi-field, in-depth cutting-edge AI paper solutions and supporting industry-level application solutions every day. Solution, provide scientific research | technology | employment one-stop service, covering supervised/semi-supervised/unsupervised/self-supervised various 2D/3D detection/classification/segmentation/tracking/pose/super-resolution/reconstruction and other full-stack fields And generative models such as the latest AIGC. Pay attention to the WeChat public account, welcome to participate in real-time academic & technical interactive exchanges, receive a CV learning spree, and subscribe to the latest information on school recruitment & social recruitment of major domestic and foreign companies!

Guess you like

Origin blog.csdn.net/CVHub/article/details/131039462