Article Directory
- 1. Definition of Multimodal
- 2. Multimodal tasks
-
- 2.1 VQA (Visual Question Answering) Visual Question Answering
- 2.2 Image Caption image caption
- 2.3 Referring Expression Comprehension Referring Expression
- 2.4 Visual Dialogue Visual Dialogue
- 2.5 VCR (Visual Commonsense Reasoning) visual commonsense reasoning
- 2.6 NLVR (Natural Language for Visual Reasoning) natural language visual reasoning
- 2.7 Visual Entailment Visual Entailment
- 2.8 Image-Text Retrieval
- 3. The way of multi-modal fusion
1. Definition of Multimodal
Multimodality refers to information in various modalities, including: text, image, video, audio, etc.
As the name suggests, multimodal research is about the fusion of these different types of data.
Most of the current work only deals with data in the form of images and text, that is, converting video data into images and converting audio data into text formats. This involves content in the image and text fields.
2. Multimodal tasks
Multimodal research is about visual language problems, and its tasks are about image and text classification, question answering, matching, sorting, positioning and other issues.
For example, given a picture, the following tasks can be done:
2.1 VQA (Visual Question Answering) Visual Question Answering
- Input: a picture, a question described in natural language
- Output: Answer (word or phrase)
2.2 Image Caption image caption
- Input: a picture
- Output: Natural language description of the image (one sentence)
2.3 Referring Expression Comprehension Referring Expression
- Input: a picture, a sentence described in natural language
- Output: Judge what the sentence describes (true or false)
2.4 Visual Dialogue Visual Dialogue
- Input: a picture
- Output: Multiple interactions and conversations between two characters
2.5 VCR (Visual Commonsense Reasoning) visual commonsense reasoning
- Input: 1 question, 4 alternative answers, 4 reasons
- Output: correct answer, and reason
2.6 NLVR (Natural Language for Visual Reasoning) natural language visual reasoning
- Input: 2 images, a distribution
- output: true or false
2.7 Visual Entailment Visual Entailment
- Input: image, text
- Output: The probability of 3 labels. (entailment, neutral, contradiction) contains, neutral, contradictory
2.8 Image-Text Retrieval
There are 3 ways.
1) Search text by picture. Input picture, output text
2) Search pictures by text. input text, output image
3) Search by image, input image, output image
3. The way of multi-modal fusion
Through the pre-training model of NLP, the embedded representation of the text can be obtained; combined with the pre-trained model in the image and vision field, the embedded representation of the image can be obtained; then, how to integrate the two to complete the above tasks?
There are two commonly used multimodal crossover methods:
3.1 Point multiplication or direct addition
In this method, the text and the image are embedding separately, and then the respective vectors are appended or dot-multiplied.
The advantage is that it is simple and convenient, and the calculation cost is relatively low.
3.2 Transformer
The advantage is that the Transformer architecture is used to better represent image features and text features.
The disadvantage is that it takes up a lot of space and the calculation cost is high.