[Computer Vision & Natural Language Processing] What is multimodality?

1. Definition of Multimodal

Multimodality refers to information in various modalities, including: text, image, video, audio, etc.

As the name suggests, multimodal research is about the fusion of these different types of data.

Most of the current work only deals with data in the form of images and text, that is, converting video data into images and converting audio data into text formats. This involves content in the image and text fields.

2. Multimodal tasks

Multimodal research is about visual language problems, and its tasks are about image and text classification, question answering, matching, sorting, positioning and other issues.

For example, given a picture, the following tasks can be done:

2.1 VQA (Visual Question Answering) Visual Question Answering

  • Input: a picture, a question described in natural language
  • Output: Answer (word or phrase)

2.2 Image Caption image caption

  • Input: a picture
  • Output: Natural language description of the image (one sentence)

2.3 Referring Expression Comprehension Referring Expression

  • Input: a picture, a sentence described in natural language
  • Output: Judge what the sentence describes (true or false)

2.4 Visual Dialogue Visual Dialogue

  • Input: a picture
  • Output: Multiple interactions and conversations between two characters

2.5 VCR (Visual Commonsense Reasoning) visual commonsense reasoning

  • Input: 1 question, 4 alternative answers, 4 reasons
  • Output: correct answer, and reason

2.6 NLVR (Natural Language for Visual Reasoning) natural language visual reasoning

  • Input: 2 images, a distribution
  • output: true or false

2.7 Visual Entailment Visual Entailment

  • Input: image, text
  • Output: The probability of 3 labels. (entailment, neutral, contradiction) contains, neutral, contradictory

2.8 Image-Text Retrieval

There are 3 ways.

1) Search text by picture. Input picture, output text

2) Search pictures by text. input text, output image

3) Search by image, input image, output image

3. The way of multi-modal fusion

Through the pre-training model of NLP, the embedded representation of the text can be obtained; combined with the pre-trained model in the image and vision field, the embedded representation of the image can be obtained; then, how to integrate the two to complete the above tasks?

There are two commonly used multimodal crossover methods:

3.1 Point multiplication or direct addition

In this method, the text and the image are embedding separately, and then the respective vectors are appended or dot-multiplied.

The advantage is that it is simple and convenient, and the calculation cost is relatively low.

insert image description here

insert image description here

3.2 Transformer

The advantage is that the Transformer architecture is used to better represent image features and text features.

The disadvantage is that it takes up a lot of space and the calculation cost is high.

insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/129395907
Recommended