【Computer Vision】Visual grounding series

1. Mission introduction

Visual grounding involves two modalities: computer vision and natural language processing.

Briefly, the input is a picture (image) and the corresponding object description (sentence\caption\description), and the output is a box describing the object.

It sounds very similar to target detection. The difference is that more language information is input. When locating an object, you must first understand the input of the language modality, fuse it with the information of the visual modality, and finally use the obtained feature representation. Make positioning predictions.

Visual grounding can be further divided into two tasks according to whether it is necessary to locate all objects mentioned in the language description :

  • Phrase Localization
  • Referring Expression Comprehension(REC)

Insert image description here

Phrase Localization is also called Phrase Grounding. As shown in the figure above, for a given sentence, all objects (phrase) mentioned in it need to be located. In the data set, all phrases have box annotations.

Insert image description here

Referring Expression Comprehension is also called Referring expression grounding. See the picture above. Each language description (here, expression) only indicates one object. Even if there is a context object in each sentence, it only corresponds to a box annotation indicating the object.

2. Commonly used data sets and evaluation indicators for Visual grounding

2.1 Commonly used data sets

  • Phrase Localization

The commonly used data set is the Flickr30k Entities data set, which contains 31,783 images. Each image corresponds to 5 different captions, so there are a total of 158,915 captions and 244,035 phrase-box annotations. Each phrase is also subdivided into eight different categories: people, clothing, body parts, animals, vehicles, instruments, scene, and othera.

In addition, many phrase localization work will also be conducted on the ReferItGame data set (also known as RefCLEF). This data set strictly speaking should belong to the REC task. The picture comes from the ImageCLEF data set, containing 130525 expressions, involving 238 different object types, 96654 objects, and 19894 images. The data is annotated through a two-player game called refer it game, as shown below:

Insert image description here

The person on the left writes the expression based on the region, and the person on the right chooses the region based on the expression.

  • Referring expression comprehension

There are three commonly used data sets: RefCOCO, RefCOCO+, and RefCOCOg. The differences between these three data sets can be understood through the following examples:

Insert image description here

2.2 Evaluation indicators

  • The intersection over
    union (IoU) between the prediction box and the ground-truth box is greater than 0.5 and is recorded as a correct positioning, which is used to calculate the accuracy (Accuracy)

Some recent work uses the Recall@k indicator to indicate the positioning accuracy of the top k prediction boxes and ground-truth boxes with IoU greater than 0.5.

  • Pointing game, select the pixel position with the largest weight in the final predicted attention mask. If the point falls within the ground-truth area, it is recorded as a correct positioning. More relaxed than Acc indicator

3. Mainstream practices of Visual grounding

Currently, Visual grounding can be divided into three types: Fully-supervised, Weakly-supervised, and Unsupervised.

Insert image description here

  • Fully-supervised: As the name suggests, it has object-phrase box annotation information.
  • Weakly-supervised: The input only has the image and the corresponding sentence, and there is no box annotation of the object-phrase in the sentence.
  • Unsupervised: There is no image-sentence information. As far as I know, only WPT[5] of ICCV2019 is unsupervised, which is very interesting and the results are also of great comparative value.

In full supervision, the current approach can be divided into two-stage and one-stage approaches.

Two-stage means that in the first stage, candidate proposals and their features are extracted through RPN or traditional algorithms (Edgebox, SelectiveSearch), etc., and then detailed reasoning is performed in the second stage. For example, a common approach is to combine visual features with The language features are projected into a common vector space , the similarity is calculated, and the closest proposal is selected as the prediction result.

One-stage is a one-stage model based on the field of target detection, such as YOLO, RetinaNet, etc.

Due to the lack of mapping between phrase and box, weak supervision will design many additional loss functions, such as based on reconstruction, introducing external knowledge, designing loss based on image-caption matching, etc.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/129386183