LoVT: joint learning of local representations for medical images and reports

Paper: https://arxiv.org/abs/2112.02889

Github:GitHub - philip-mueller/lovt: Localized representation learning from Vision and Text (LoVT)

Summary

Abstract Contrastive learning has been shown to be effective for pre-trained image models on unlabeled data and has shown promising results in tasks such as medical image classification. Using paired text (such as radiology reports) during pre-training further improves the results . However, most existing methods target image classification downstream tasks and may not be optimal for local tasks such as semantic segmentation or object detection. Therefore, we propose Learning of Local Representations from Vision and Text (LoVT) , which, to our knowledge, is the first text-supervised pre-training method for localized medical imaging tasks . The method combines instance-level image-report contrastive learning with local contrastive learning of image region and report sentence representations . We evaluate LoVT and commonly used pre-training methods on our evaluation framework on 18 chest radiograph localization tasks from 5 public datasets. LoVT performed best on 10 of the 18 research tasks, making it the method of choice for localization tasks.

background

1) High-quality annotation data for medical images is scarce

2) Rule-based natural language processing (NLP) models such as CheXpert [37] extract labels from these reports, allowing the automatic creation of large datasets, but they also have some obvious limitations and can generally only be used for classification. They generate overall labels (and thus pairs of images) for reporting, but it is important to associate these labels with specific image regions, so they cannot be used for local tasks such as semantic segmentation or object detection. Also, rule-based NLP models have to be manually created and cannot generalize to different classification tasks or even different report writing styles [37]. In addition to generating classification labels, these reports can also be directly used in pre-training methods, such as the ConVIRT method first proposed [96]. Here, the semantic information contained in the report is used as weak supervision to pre-train the image model, which is then fine-tuned for the downstream task of labeling, which can improve the results or reduce the number of labeled samples. But not for local downstream tasks.

contribute

1) We propose a local contrastive loss that allows aligning local representations of sentences or image regions while encouraging spatial smoothness and sensitivity

2) We split each report into sentences and each image into regions (i.e., patches), compute sentence and region representations, and align them using an attention mechanism and our proposed local contrastive loss .

3) We use attention pooling on region and sentence representations to compute global (i.e. per-image and per-report) representations, and then use a global contrastive loss to align them.

4) We propose the Learning Model for Localization Representations from Vision and Text (LoVT) , a pre-training method that extends ConVIRT [96] using our proposed ideas, and outperforms on most localization downstream tasks it.

5) We evaluate our approach trained with MIMIC-CXR [42, 41, 40, 26] on the downstream evaluation framework [58] , which consists of 18 chest radiograph localization tasks, including object detection on 5 public datasets and semantic segmentation.
We compare it to several self-supervised and text-supervised methods, and to classification transfer on more than 1400 evaluation runs. Our method LoVT proved to be the most successful method, outperforming all other methods on 10 out of 18 tasks.

related work

contrastive learning contrastive learning

Most contrastive learning methods only use instance-level contrasts , i.e. each view of an image is represented by a single vector. While the resulting representations are well suited for global downstream tasks, they are not designed for local downstream tasks.

Therefore, a number of methods using region-level contrasts have recently emerged . That is, they operate on representations of image regions. Unlike our method, these methods do not use paired text.

self-supervised representation learning self-supervised representation learning method

Pre-train image models for downstream tasks with accompanying text:

Directly list the literature ↓

[67]Learning transferable visual models from natural language supervision

[39] Scaling up visual and vision-language representation learning with noisy text supervision

[96] Contrastive learning of medical visual representations from paired images and text.

[12] Learning visual representations from textual annotations

[73] Learning visual representations with caption annotations

[51] Learning data-efficient visual representations from localized textual supervision

Image captioning task (generation task):

VirTex[12] and ICMLM

Multi-perspective contrastive learning: 

ConVIRT[96], CLIP[67] and ALIGN, more suitable for discriminative downstream tasks, and NT-Xent for image and text view loss functions. The main difference between these methods is the dataset they work on, ConVIRT is trained on chest x-rays, while other methods use natural images. Furthermore, CLIP uses attention pooling to compute image representations from feature maps, while other methods use the image encoder's default pooling method

Our method follows a similar framework but adds a local contrastive loss for better performance on local tasks. Furthermore, it encodes the entire report instead of sampling individual sentences, and uses attention in the image and text encoders

local Mutual Information approach local mutual information method

Contrastive learning is performed on reported sentences and image regions, but targets classification rather than a local task, thus encouraging neither contrast between regions nor spatial smoothness.

method

 We randomly sample one of the images related to a given report and divide it into regions of equal size 7 × 7. More precisely, we upscale and resize the image to a size of 224×224, feed it into a convolutional neural network, and use an output feature map of size 7×7 as a region representation.

A language model encodes the reported tokens into contextualized vector representations (taking into account their meaning in the overall report), from which we compute sentence representations.

A many-to-many alignment model is then used to compute cross-modal representations from unimodal representations , i.e. image region representations from sentence representations and vice versa.

We argue that by aligning cross-modal and unimodal representations , image region representations can be encouraged to contain the high-level semantics present in the reports .

Model overview

It's a twin tower model

 Each training sample xi is a pair of images xIi ∈ R224×224 and an associated report xRi consisting of Mi sentences. xIi and xRi are encoded as two global representations for the image and report, respectively, and multiple local representations for each sample, corresponding to image regions and report sentences, respectively. An attention-based alignment model then computes cross-modal representations (i.e., sentence representations from image regions and vice versa) that are aligned to local unimodal representations using a local contrastive loss. Furthermore, the global representations are aligned using a global contrastive loss. The encoder and alignment model are jointly trained on batches of image report pairs xi. The details of the model and loss function are described in the following sections.

Encoding encoder

image

Each image xIi is encoded as K = H × W (we use K = 7 × 7) region representation yIi, K ∈ RdI using the image encoder EI, where K is the index of the image region and dI is the dimension of the image region representation space number.

ResNet50 uses the feature map before global average pooling as a region representation

text

Use the report encoder ER to encode each report xRi into a Mi-sentence representation yR i,m ∈ RdR. where Mi is the number of sentences in the report sample i, m is the index of the sentence, and dR is the dimensionality of the representation space of the report sentence. Note that while K is constant, Mi may be different for each sample. Any model that encodes sentences into vector representations can be used for ER.

The BERT base  jointly encodes the tokens of each reported connected sentence, and then performs Max pooling on the token representations of each sentence to obtain the sentence representation

attention pooling layer

multi-head querykey-value attention

projection

We compute projected local representations zI i, k ∈ RdZ and zri, m ∈ RdZ, projected global representations ¯zI ∈ R ¯dZ and ¯zri ∈ R ¯dZ, from representations yI i,k, yR i,m, ¯yI i and ¯yRi, using the (non-shared) nonlinear transformations fI, fR, ¯fI and ¯fR, respectively, and dZ is the dimensionality of ¯dZ for the shared local and shared global representation spaces (we use 512 for both). Note that for local representations, projections are applied independently to each region k or sentence m.

Alignment Model Alignment Model

We compute the alignment of image regions and sentences, and compute cross-modal representations using single-head query-key-value attention based alignment models AI→R and AR→I [82] .

For each sentence m, the cross-modal representation zI→ri,m is computed by letting zri,m participate in all image region representations zI,i,k (related images). Thus, we compute the probability αI→R i,m,k that sentence m is aligned with region k based on the scaled point integral scores of their projected representations, where Q is a learned matrix in the linear query key projection.

Then the alignment model AI→R uses αI→R i,m,k to compute zI→R i,m as the projected weighted sum of image region representations zI,i,k, where the value projection V and the output projection O are the learning matrices.

 In a similar way, for the alignment model AI→R, the cross-modal representation zR→II,k is given by zR→I i,k based on

in,

 

 Note that since AR→I and AI→R share the same matrices Q, V and O , the only difference between αR→II,k,m and αI→RI,m,k is the index of transposition and softmax application .

Loss Function loss function

Global Aligenment

For global alignment, we follow ConVIRT [96], maximizing the cosine similarity between paired image and report representations while minimizing the similarity between unpaired (i.e., from different samples) representations.
The loss consists of the image report part, where all unpaired report representations are used as negative examples

where τ is the similarity temperature (we use 0.1), and all logarithms are natural.

 Report-image section, defined like this:

Combine these two parts using the hyperparameter λ ∈ [0,1] (we use 0.75):

 

 Local Alignment

Put a screenshot for the time being, I am a little confused (spread), straighten out and then explain

 

experiment

We evaluate 18 local tasks on chest x-rays in our downstream evaluation framework [58], which we briefly describe here.

Evaluation Protocols Evaluation Scheme

We only use pretrained ResNet50 (from image encoder).

semantic segmentation

(i) U-Net Finetune: Here, ResNet50 is used as the backbone of U-Net [70] and co-fine-tuned with all other layers;

(ii) U-Net Frozen: Here, ResNet50 is used as the frozen backbone of U-Net [70], and only the non-backbone layers are fine-tuned;

(iii) Linear : Here, an element-wise linear layer is trained, which is applied after freezing the last feature map of ResNet50 (before pooling), before the result is upsampled to segmentation resolution.

Target Detection

(i) YOLOv3 Finetune: ResNet50 is used here as the backbone of the YOLOv3[69] model, and fine-tuned together with the non-backbone layer

(ii) YOLOv3 Frozen : ResNet50 is used here as the frozen backbone of the YOLOv3 [69] model, and only non-backbone layers are fine-tuned

(iii) Linear : Here the object detection ground truth is converted to a segmentation mask, which is then evaluated using the Linear segmentation protocol.

Downstream Datasets Downstream Datasets

(i) RSNA Pneumonia Detection [86,74], more than 260,000 frontal chest x-rays, the detection target is pneumonia opacity. We use YOLOv3 Finetune, YOLOv3 Frozen and Linear with 1%, 10% and 100% training samples for each protocol;

(ii) COVID Rural [81,13], over 200 frontal chest x-rays using segmented masks for opaque regions of the COVID-19 lung. We use UNet Finetune, UNet Frozen and Linear;

(iii) SIIM-ACR Pneumothorax Segmentation [75], using more than 12000 pneumothorax frontal radiographs and segmented masks. We use UNet Finetune, UNet Frozen protocols, but since Linear is not used because of the fine-grained nature of segmentation masks;

(iv)  Object CXR [38], using 9000 frontal chest x-rays, detects objects as foreign bodies. We use YOLOv3 Finetune, YOLOv3 Frozen and Linear protocols;

(v)  NIHCXR [86], nearly 1000 frontal chest x-rays, were detected for 8 pathologies (atelectasis, cardiac hypertrophy, effusion, infiltration, mass, nodule, pneumonia, and pneumothorax). Due to limited data per class, we only use the Linear protocol.

U-Net Finetune and YOLOv3 Finetune evaluate the degree of fine-tuning of pre-trained image models in practical applications

Linear protocols directly evaluate the learned local representations (i.e., feature maps) while adding as few parameters as possible, thus mostly omitting the variance introduced by random initialization in downstream evaluation.

The U-Net frozen protocol and the YOLOv3 frozen protocol can be seen as a middle ground between two extremes, where the representation is frozen but evaluated in a more realistic setting (but with many randomly initialized layers). Collectively, this allows analysis of many aspects of pretrained representations.

Tuning and Evaluation Procedure

We tune all models on a single downstream task with a 10% freeze on RSNA YOLOv3. No other downstream tasks were evaluated during tuning to ensure that the model is not biased towards downstream tasks. After tuning, each model is evaluated on all downstream tasks.
For each task, the downstream learning rate was tuned for each model (using a single evaluation run), followed by five evaluations (all using the tuned learning rate). We report the mean results of these five runs with their 95% confidence intervals.

Pre-Training Dataset

We train our method on version 2 of MIMIC-CXR [40, 41, 42, 26] because, to our knowledge, it is the largest and most commonly used dataset of this type. Since all downstream tasks contain only frontal views, we remove all side views, leaving approximately 21,000 training samples, each with one report and one or more frontal images.

Compare Baseline network

Random Init: Initialize ResNet50 using the default random initialization method

ImageNet[71] Init: ResNet50 is initialized with pre-trained weights on the ImageNet ILSVRC-2012 task[71];

CheXpert[37]:  ResNet50 pretrained on frontal chest x-rays of patients in MIMIC-CXR using CheXpert[37] labels for supervised multi-label binary classification

Global image pre-training methods:  ResNet50 is pretrained on MIMIC-CXR frontal chest x-rays using the self-supervised pre-training methods SimCLR [9] or BYOL [30]. We decided to include SimCLR because it uses a similar loss function to LoVT, and we included BYOL because of its widespread use

Local image pre-training methods:  ResNet50 is pretrained on MIMIC-CXR frontal chest x-rays using the self-supervised pre-training method PixelPro [92]. We use PixelPro to study the effect of local contrast loss when using only images.

Global image-text pre-training methods: ResNet50 is pre-trained on the frontal MIMICCXR using the image-text method ConVIRT [96] or CLIP [67]. Note that for comparability, we adapted CLIP to use the same image and text encoders as ConVIRT, such that the main difference between CLIP is

Experimental results

1. All results are averaged over five evaluation runs and 95% confidence intervals are shown. The best results for each task are underlined, the sub-optimal results are dashed, and the best results for each pre-training category (general initialization, 30% and 100% pre-training) are highlighted in bold. Note that the YOLOv3 Frozen 10% task (task 5) is used to tune all methods and thus may not be representative as methods may overfit on this task.

2. All results are averaged over five evaluation runs and 95% confidence intervals are shown. The best results for each task are underlined, the sub-optimal results are dashed, and the best results for each pre-trained category are dashed (general initialization, 30% and 100% pre-training) highlighted in bold show.

 

Ablation experiment

There are too many, read the appendix of the article yourself!

Guess you like

Origin blog.csdn.net/Scabbards_/article/details/132039282