VQA for medical vision - a review

Reading of VQA parper

Medical Visual Question Answering: A Survey

  1. The state-of-the-art data sources, data volume, and task characteristics of publicly available medical VQA datasets are collected and discussed.
  2. Methods (techniques, innovations and improvements in performance) used in the medical VQA task are reviewed.
  3. Some medical-specific challenges in this field are analyzed and future research directions are discussed.


Datasets and Performance Metrics

Datasets (mainly related to radiology and pathology)

There are 8 publicly available medical VQA datasets so far:
VQA-med-2018[31]
VQA-rad[45]
VQA-med-2019[14]
RadVisDial[44]
PathVQA[33]
VQA-med-2020[13] ]
SLAKE[53]
VQA-med-2021[15] (in chronological order)

VQA-Med-2018 (the first publicly available dataset in the medical field)

VQA-Med-2018 [31] is a dataset proposed in ImageCLEF 2018 , which is the first publicly available dataset in the medical field. QA pairs are generated from subtitles by a semi-automatic method. First, a rule-based question generation system (QG ) automatically generates possible question-answer pairs through sentence simplification, answer phrase recognition, question generation, and candidate question ranking. Then, two human annotation experts (including a clinical medical expert) manually checked all generated QA pairs in two passes. One pass ensures semantic correctness and another pass ensures clinical relevance to related medical images.
insert image description here

VQA-RAD (Radiology Specific Dataset)

VQA-RAD [45] is a radiology specific dataset proposed in 2018. This image set is a balanced image containing samples of the head, chest, and abdomen from MedPix . To investigate questions in real-world scenarios, the authors presented images to clinicians to collect unguided questions. Clinicians were asked to ask questions in both free and template structures. Subsequently, QAs were manually validated and categorized for analysis of clinical focus.
Answer types are either closed or open. Although there is not a large amount of data, the VQA-rad dataset has captured the basic information that a medical VQA system as an AI radiologist should be able to answer.
insert image description here

VQA-Med-2019 (four question categories: modality, plane plane, organ system, and abnormality abnormality. The first three categories are treated as classification tasks, and the fourth category is used as questions for generating answers)

VQA-Med-2019 [14] is the second version of VQAMed, released during the ImageCLEF 2019 challenge. Inspired by VQA-RAD [45], VQA-Med-2019 addresses the four most common problem categories: modality, plane, organ system, and abnormality . For each category, the questions follow the pattern of hundreds of questions naturally posed and validated in VQA-RAD [45]. The first three categories (Modality, Plane, and Organ System) can be approached as a classification task, while the fourth category (Anomalies) poses a problem for generating answers .
insert image description here

RadVisDial (the first publicly available dataset for visual dialogue in radiology)

RadVisDial [44] is the first publicly available dataset for visual dialogue in radiology. Visual dialogue, consisting of multiple QA pairs, is considered to be a more practical and complex task for radiological AI systems than VQA. Images are selected from MIMIC-CXR [37]. For each image, MIMIC-CXR provides a well-structured correlation report with annotations for 14 labels (13 anomalies and a No Findings label). RadVisDial consists of two datasets: a silver-standard dataset and a gold-standard dataset.
In the silver-standard dataset set, dialogues are synthetically created using plain-text reports associated with each image. Each dialogue contains 5 questions randomly drawn from 13 possible questions. The corresponding answers are automatically extracted from the source data and limited to four options ( yes, no, probably, not mentioned in the report ).
In the gold-standard dataset set, conversations were collected from two radiologists, following detailed annotation guidelines to ensure consistency. Only 100 random images were labeled as the gold standard. The RadVisDial dataset explores a real-world scenario task for AI in healthcare. In addition, the team compared synthetic dialogue with real-world dialogue and conducted experiments to reflect the importance of contextual information. The patient's medical history is presented, improving accuracy.
insert image description here

PathVQA (exploring datasets for Pathology VQA)

PathVQA [33] is a dataset exploring VQA in pathology . The pictures accompanying the illustrated text were extracted from digital sources (e-textbooks and online libraries). The authors develop a semi-automatic pipeline to convert subtitles into QA pairs, and manually inspect and modify the generated QA pairs. Questions can be broken down into seven categories: what, where, when, whose, how, how much/how many, and yes/no . Open questions accounted for 50.2% of all questions. For the "yes/no" question, the answer is 8145 "yes" and 8189 "no". The questions are based on the American Board of Pathology (ABP) pathologist certification exam. So it's an exam to validate an "AI pathologist" in decision support. The PathVQA dataset demonstrates that medical VQA can be applied in various scenarios.
insert image description here

VQA- ​​with -2020

VQA-Med-2020 [13] is the third version of VQAMed, published in the ImageCLEF 2020 challenge. The selection of images is limited for diagnosis based on image content. These questions are specific to exceptions. Choose a list of 330 outlier questions, each of which needs to appear at least 10 times in the dataset. QA pairs are generated from the created schema.
In VQA-Med-2020, the Visual Question Generation (VQG) task is introduced to the medical domain for the first time. The VQG task is to generate natural language questions related to image content. The medical VQG dataset includes 1001 radiology images and 2400 related questions. Based on image captions, ground truth questions are generated with a rule-based approach and manually modified.
insert image description here

SLAKE (a comprehensive dataset with semantic labels and structured medical knowledge base)

SLAKE [53] is a comprehensive dataset with semantic labels and structured medical knowledge base . These images are selected from three open source datasets [75, 88, 40] and annotated by experienced physicians. Semantic labels for images provide masks (segmentation) and bounding boxes (detection) for visual objects. The medical knowledge base is provided in the form of a knowledge graph . Knowledge graphs are extracted from OwnThink and manually reviewed. They are in the form of triplets (eg, <heart, function, facilitates blood flow>). The dataset contains 2,603 ​​English triplets and 2,629 Chinese triplets. The introduction of knowledge graphs enables knowledge-based external issues such as organ function and disease prevention. Questions were collected from experienced physicians by selecting from pre-set questions or rephrasing questions. It is then categorized by the type of question and balanced to avoid bias.
insert image description here

VQA- ​​with -2021

VQA-Med-2021 [13] is released in the ImageCLEF 2021 challenge. VQA-Med-2021 was created based on the principles of VQA-Med-2020. The training set is the same as the dataset used by VQA-Med-2020. The validation and test sets are new and checked manually by doctors.
insert image description here

method

insert image description here

Guess you like

Origin blog.csdn.net/weixin_44845357/article/details/126893221