LViT: Language and Vision Transformer in Medical Image Segmentation

Paper link: https://arxiv.org/abs/2206.14718

代码链接:GitHub - HUANGLIZI/LViT: This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation" (IEEE Transactions on Medical Imaging/TMI)

Summary

Deep learning has been widely used in medical image segmentation and so on. However, the performance of existing medical image segmentation models has been limited by the challenge of obtaining sufficiently high-quality labeled data due to the prohibitive cost of data annotation . To alleviate this limitation, we propose a novel text-enhanced medical image segmentation model LViT (Language meets Vision Transformer). In our LViT model, medical text annotations are incorporated to compensate for the quality deficiencies of image data . Furthermore, in semi-supervised learning, textual information can guide the generation of pseudo-labels with improved quality . We also propose an exponential pseudo-label iterative mechanism (EPI) to help the pixel-level attention module (PLAM) maintain local image features in the semi-supervised LViT setting . In our model, the LV (Language-Vision) loss is designed to directly supervise the training of unlabeled images using textual information. For evaluation, we construct three multimodal medical segmentation datasets (image+text) containing x-ray and CT images . Experimental results show that the proposed LViT has good segmentation performance in both fully-supervised and semi-supervised settings.

background

1) Unlike natural images, the boundaries between different regions in medical images are often blurred, and the gray value difference near the boundaries is small, making it difficult to extract high-precision segmentation boundaries. High-quality medical image data is difficult to obtain, and medical text record data and image data are naturally complementary, so text information can make up for the lack of quality of medical image data.

2) To solve the problem of underlabeled data, some methods have gone beyond traditional supervised learning by using labeled and more widely available unlabeled data to train models, such as semi-supervised learning [5], [8] and weakly supervised learning [9]. But the learning effect is very dependent on the pseudo-label quality.

contribute

1) How to use existing image-text information to improve segmentation performance;

sol: We propose the LViT model (Fig. 1(b)), which is innovative in processing images and text. In LViT, using an embedding layer instead of a text encoder to obtain text feature vectors can reduce the number of parameters in the model. In addition, the hybrid CNNTransformer structure with pixel-level attention module (PLAM) can better incorporate textual information and use Transformer to encode global features while preserving the local features of CNN

2) How to make full use of text information to ensure the quality of pseudo-labels.

sol: We design an Exponential Pseudo label Iteration mechanism (EPI ), which aims to cross-utilize the label information of labeled data and the latent information of unlabeled data. EPI indirectly combines text information to gradually improve pseudo-labels in the form of an exponential moving average (EMA) [10]. Furthermore, the LV (Language-Vision) loss is designed to directly utilize textual information to supervise training on unlabeled medical images. To verify the performance of LViT, we construct three multimodal medical image segmentation datasets containing CT images (MosMedData+ [11], [12] and ESO-CT) and X-rays (QaTa-COV19 [13]). The results show that LViT has better segmentation performance, with Dice score of 74.57% and mIoU of 61.33% on the MosMedData+ dataset; Dice score of 83.66% and mIoU of 75.11% on the QaTa-COV19 dataset; The Dice score on the CT dataset is 71.53%, and the mIoU is 59.94%. It is worth noting that LViT using 1/4 of the training set labels can still achieve the same performance as fully supervised segmentation methods.

related work

Semantic Segmentation of Medical Images

1. FCN

2. UNet

3. UNet++

visual language model

1. Clip

2. ViLT, which allows harnessing the power of interaction layers to process visual features without requiring a separate deep visual embedder. It is a pure Transformer model without convolution and regional supervision to extract local features, which is not suitable for medical image segmentation with blurred boundaries.

3. VLT, Visual Language Translator The VLT framework achieves reference segmentation by facilitating deep interaction between multimodal information, fusing linguistic and visual features.

4. LAVT, the Aware Vision Transformer framework adopts an early fusion scheme, integrates language features into visual features through a pixel word attention mechanism, and can effectively use Transformer encoders to model multimodal contexts

LViT: only uses the embedding layer to transform the text features, requires fewer parameters, and lowers the computational cost. In addition, hybrid CNN-Transformer can obtain global and local features at the same time

attention mechanism

1. RAN

2. CBAM

LViT: We propose PLAM to make up for the lack of attention to local features through self-attention. It also helps convolutional layers to produce more efficient local feature representations. To address the computationally intensive problem, we use a unified encoder to encode visual and linguistic features instead of separate encoders.

method

LViT model

The LViT model is a dual u-shaped structure consisting of a u-shaped CNN branch and a u-shaped Transformer branch.

Among them, the CNN branch is used as the information input source and the segmentation head of the prediction output, and the ViT branch is used for the combination of image and text information, and the ability of Transformer to process cross-modal information is used. After simply vectorizing the text, the text vector and the image vector are combined and sent to the u-shaped ViT branch for processing. We also need to do similar processing for text input during the model inference phase. And the fusion information of the corresponding size is passed back to the U-shaped CNN branch of each layer for the final segmentation prediction. In addition, a pixel-level attention module (PLAM) is set at the skip position of the u-CNN branch. Using PLAM, LViT can retain as much local feature information of the image as possible.

 (1) U-shape CNN Branch

 The Ushaped CNN branch receives image information as a segmentation head to output a prediction mask.
Each CNN module is composed of Conv, BatchNorm (BN) and ReLU activation layers . The image features are down-sampled with Maxpool between each DownCNN module . A join operation is performed between each UpCNN module .

The specific process of each CNN module is described by Eqn. 1 and 2, where YDownCNN,i represents the input of the i-th DownCNN module, which becomes YDownCNN,i+1 after downsampling the i-th DownCNN module and MaxPool layer. In addition, we design a CNN-ViT interactive module , using methods such as upsampling to align features from ViT. The reconstructed ViT features are connected with CNN features through residuals to form CNN-ViT interactive features. In addition, in order to further improve the segmentation ability of local features, PLAM is designed at the jumping point: CNN-ViT interactive features are input into PLAM, and then the interactive features are passed to the UpCNN module to give information layer by layer.

(2) U-shape ViT Branch

 Referring to the u-shaped CNN branch, the u-shaped ViT branch is designed to combine image features and text features . As shown in Fig. 2(a), the first-layer DownViT module receives the text features input by BERT-Embed [42] and the image features input by the first-layer DownCNN module. The pre-trained model of BERT-Embed is the BERT_12_768_12 model, which can convert a single word into a 768-dimensional word vector.

 The specific cross-modal feature merging operation is expressed as the formula, ximg, i represents the image feature from DownCNN, xtext represents the text feature, patchembeds can help YDownCNN, i form the embedded feature ximg, i. ViT denotes the T encoder [39], that is, Y = ViT (x) = V iT2 (V iT1(x)).

 ViT consists of a Multi-Head Self-Attention
(MHSA) module and an MLP layer. LN stands for normalization layer. The CTBN block also includes a Conv layer, a BatchNorm layer, and a ReLU activation layer for aligning the feature dimensions of ximg, 1, and xtext . Subsequent DownViT modules in layers 2, 3, and 4 receive both the features of the upper-layer DownViT modules and the features of the DownCNN modules of the corresponding layers, as shown in Equation 7.

When i= 1, 2, 3, the features of the corresponding size are sent back to the CNN-ViT interactive module through the UpViT module. Merge this feature with the features of the DownCNN module of the corresponding layer. This can maximize the extraction of the global features of the image and avoid oscillations in model performance due to inaccurate text annotations.

(3)

PLAM aims to preserve the local features of images and further fuse semantic features in text .
In addition, it can also enhance the performance of convolutional layers in generating strong local feature representations.
Referring to CBAM [36], our PLAM uses parallel branches for Global Average Pooling (GAP) and Global Max Pooling (GMP ).
We also incorporated join and add operations. The addition operation will help to merge corresponding channel features with similar semantics and save computation. In contrast, the join operation can integrate feature information more intuitively and helps preserve the original features of each part. After concatenating feature information, we use MLP structures and multiplication operations to help align feature sizes .

In general, our PLAM differs from the Pixel Word Attention Module (PWAM) in LAVT in several ways [27]. First, PLAM alleviates the preference for global features brought by Transformer by enhancing local features . In contrast, PWAM aims to align visual and language representations through cross-attention. Secondly, in terms of implementation, PLAM uses a combination of channel attention and spatial attention , while PWAM uses a cross-self-attention mechanism. Overall, PLAM aims to enhance local features to improve the performance of medical images

Exponential Pseudo-label Iteration mechanism Index pseudo-label iteration mechanism

 In this section, we propose the Exponential Pseudo-Labeling Iteration Mechanism (EPI), which aims to help extend the semi-supervised version of LViT . In EPI, the pseudo-labels are iteratively updated using the idea of ​​EMA [10], as shown in Fig. 3(a) and Eqn. 8.

In the formula, Pt−1 represents the predicted value of model Mt−1.
Set the momentum parameter β to 0.99. It is worth noting that here Pt−1 is an N-dimensional prediction vector, where N represents the number of categories and each dimension represents the prediction probability. Therefore, EPI can gradually optimize the model's segmentation prediction results for each unlabeled pixel and is robust to noisy labels. This is because we do not simply target the pseudo-labels predicted by one generation model as the target of the next generation model, which can avoid the drastic deterioration of pseudo-label quality. (It is proved that there is in the original paper, so I will not put it here)

LV (Language-Vision) Loss

 To further utilize textual information to guide pseudo-label generation , we design a LV (Language-Vision) loss function, as shown in Figure 3(b). Generally speaking, the positions of human organs in medical images are relatively fixed. Therefore, we can use structured textual information to form corresponding masks (contrast labels) . We calculate the cosine similarity between texts as shown in Equation 16

 Among them, xtext,p represents the text feature vector corresponding to the pseudo-label, and xtext,c represents the text feature vector corresponding to the comparison label. Then, according to the TextSim algorithm, select the comparison text with the highest similarity and find the segmentation mask corresponding to the text; we use the label similarity to calculate the cosine similarity between the predicted segmentation pseudo-label and the comparison label, as shown in Equations 17 and 18 Show.

 In the formula, ximg,p represent the pseudo-label feature vector, ximg,c represent the comparative label feature vector.
Compared with Euclidean distance, cosine similarity is less sensitive to absolute value and reflects similarity more qualitatively, which fits our task motivation. Contrastive markers mainly provide marker information for approximate locations rather than refinement of boundaries.

Therefore, the primary purpose of LV loss is to avoid cases with significant differences from being mis-segmented or mislabeled.
Therefore, we only use the LV loss when the data is unlabeled, because when the data is labeled, the contrastive label does not help much in improving performance. In the unlabeled setting, LV dropout with consistent supervision avoids a drastic deterioration in pseudo-label quality. It is worth noting that the pseudo and contrast labels in our LViT aim to address different problems compared to the masked conservative learning in VLT [29].

First, pseudolabels and contrastive labels are designed for semi-supervised learning, while hidden conservative learning aims to explore the knowledge of different linguistic representations related to a single object. Second, LViT determines whether cases are similar by computing text similarity , while VLT does so by extracting text features. However, in the medical domain, determining the similarity between radiology reports through implicit feature extraction is difficult because different radiology reports may have only few wording variations.

Therefore, structured formats are often used to differentiate reports. Furthermore, unlike masked conservative learning, we design an Exponential Pseudo label Iteration (EPI) mechanism to guarantee the quality of pseudo-labels with text information, which cross-utilizes the label information of labeled data and Potential information of unlabeled data.

Proof of the Superiority of CNN-Transformer Structure

Unlike previous vision and language work, we propose that the LViT model is innovative in processing images and text.
Instead of using a text encoder, we creatively exploit the interaction between CNN and ViT to extract features.

There are too many formulas, it is very troublesome to type. jpg, skip it first and fill it up later (maybe)

experiment

 data set

1) MosMedData+

Contains 2,729 CT scans of lung infections

2) QaTa-COV19

The dataset consists of 9258 COVID-19 chest x-rays with manual annotations of COVID-19 lesions for the first time. Furthermore, we extend the dataset with text annotations for training visual language models. We extended text annotations on QaTa-, the first covid-19 dataset built with the help of professionals. Text annotations focus on whether both lungs are infected, the number of lesioned areas, and the approximate location of the infected areas.

3) ESO-CT

Consists of 286 cases

loss function

LDice = Dice

LCE = Cross entropy 

 For unlabeled data, an additional term is introduced on the loss LLV, α = 0.1. For labeled data, α = 0. Segmentation performance is evaluated using Dice and mIoU. An early stopping mechanism is adopted during the training phase.

where N represents the number of pixels and C represents the number of categories, which are set to 1 in our experiments.
pij represents the predicted probability that pixel i belongs to class j, and yij represents whether pixel i belongs to class j. yij = 1 if pixel i belongs to class j, and 0 otherwise.

Evaluation index

Dice Loss and mIoU metrics are used to evaluate the performance of our LViT model and other SOTA methods

implementation details

Framework: pytorch

Hardware: The operating system is Ubuntu 16.04.12 LTS, the CPU is Intel(R) Xeon(R) Gold 5218, the GPU is 2-card TESLA V100 32G, and the memory capacity is 128gb.

learning rate:

The initial learning rate of the QaTa-COV19 dataset is set to 3e-4

The initial learning rate of the MosMedData+ dataset is set to 1e-3.

We also use an early stopping mechanism until the performance of the model does not improve within 50 epochs. Since each dataset has a different data size, a different batch size is also set. The default batch size for the QaTa-COV19 dataset is 24 for
the MosMedData+ dataset.

Experimental results

Not much to say, look at the picture

Ablation experiment

Evaluate from the following aspects

1. Effectiveness of supervised components

There is no significant benefit of using LLV on labeled data.

2. Model size

LViT with text annotations only has 1.7M more parameters and 0.1G more computation than LViT-w, while the improvement of segmentation performance by text information is significant.

If there are significant differences in the dataset distribution and image segmentation is challenging, increasing the model size can improve performance. But it is worth noting that as the model size increases, the performance jitter of the model decreases, indicating that the model becomes more robust.

3. Hyperparameters

Hyperparameters have a greater impact on model performance than model size.

4. Ablation Study of Text Encoder and Embedding Layer

One set focuses on existing well-structured texts, while the other focuses on poorly structured texts.

Using a text encoder requires almost three times as many parameters and computation as using a text embedding layer compared to using a text embedding layer.

However, despite the increased complexity, the performance of the model does not increase, and even decreases in well-structured reports. This finding supports our decision to use a text embedding layer in the LViT model.

It is worth noting that for poorly structured reports, the performance of the model with the text embedding layer is slightly lower than that of the text encoder.

We think this difference can be attributed to the better encoding ability and robustness of the text encoder when dealing with more diverse radiology reports. However, it is important to realize that the resulting parameter and computational costs are not cost-effective.

5. Semi-supervised

These experiments cover two different label ratios of 25% and 50% to explore the performance variation under different label ratios.

Our proposed LViT model has better segmentation performance than other methods. This is attributed to the exponential pseudo-label iterative mechanism and LV loss, regardless of whether textual information is included in the pipeline.

 

interpretability

GradCAM

Guess you like

Origin blog.csdn.net/Scabbards_/article/details/131976535