Original link: https://arxiv.org/abs/2307.07362
A medical multi-modal review. I focused on the segmentation and looked at the segmentation. I didn’t have time to look at the rest of the tasks. I took a model summary diagram. If you want to know more about it, go to the above paper.
Dataset summary
Report generation Report generation
Report generation is designed to automatically generate descriptions from EHR and medical images.
It reduces the workload of clinicians and improves the quality of the reports themselves. Since the training process for report generation typically requires medical images and text reports written by clinicians, it can be naturally viewed as a multimodal learning process.
1)a CNN encoder and hierarchical LSTM decoder
2)Transformer architecture
3)AlignTransformer
4)self-supervised learning techniques, such as CLIP
5) reward mechanisms improve accuracy
Model summary
Judgment criteria
1. text quality text quality
Refers to the readability, accuracy and validity of the text.
BLEU [19], METEOR [50], and ROUGE-L [51]
2. medical correctness
AUC, precision, recall, F1, RadCliQ
3. explainability explainability, explainability
factENT, factENTNLI
Visual question answering Visual question answering
Model summary
Cross-modal retrieval Cross-modal information retrieval
Model summary
Diagnostic classification Diagnostic classification
Model summary
Semantic segmentation Semantic segmentation
The effectiveness of image-text contrastive learning, which involves using semantic segmentation to extract visual features that can be juxtaposed with textual features, to facilitate understanding of the relationship between an image and its corresponding textual description (Table 6). Furthermore, local alignment evaluation in contrastive learning is evaluated using semantic segmentation techniques.
Image-Text Alignment (Image-Text Alignment) and Local Representation Learning (Local Representation Learning) are commonly used semantic segmentation methods in MDL. These techniques can help improve the accuracy of the model and enable it to better understand the relationship between different regions in the image. The spatial relationship between them and the relationship between visual and textual information [119]
Li et al. [120] proposed LViT to utilize medical text annotations to improve the quality of image data and guide the generation of pseudo-labels for better segmentation performance. Muller et al. [121] designed a novel pre-training method, LoVT , designed to specifically solve localized medical imaging tasks. Their method outperforms commonly used pre-training techniques on 10 out of 18 localization tasks.
Model summary
data set
SIMI
The data set includes 12,047 chest radiographs, as well as the corresponding manual annotations
RNSA
The dataset includes 29,700 frontal radiographs for evaluation of evidence of pneumonia
MS-CXR
It consists of 1153 image-sentence pairs with annotated bounding boxes and corresponding radiologist-verified phrases. This dataset covers eight different cardiopulmonary radiology findings.
Judgment criteria
1)Dice
2)Miou (mean intersection over union)
3)CNR (contrast-to-noise ratio)