[AIGC] 15, Grounding DINO | Extend DINO to open set target detection

insert image description here

论文:Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Code: https://github.com/IDEA-Research/GroundingDINO

Source: Tsinghua University, IDEA

Time: 2023.03.20

contribute:

  • This paper proposes an open-set target detector, Grounding DINO, which combines the Transformer based detector DINO and grounded pre-training, which can output the detection frame of any category of targets according to any input (such as categories or other adjectives).
  • This paper proposes to extend the evaluation of open-set object detection to the REC (meaning that a target is extracted according to the input description) data set, which can help evaluate the performance of the model in the case of free-form text input

1. Background

insert image description here

Understanding new visual concepts is the basic ability that a visual model should have. Based on this, the author proposes a powerful detector open-set object detection, which can detect any target that can be described in human language.

Moreover, this task can also be used in combination with other models, which has great potential. As shown in Figure 1b, it can be combined with the generative model for image editing

The key point of open-set target detection is to introduce the language model into the closed-set target detector to realize the generalization of open-set, and to be able to recognize targets that have not been seen before.

For example, GLIP defines target detection as a phrase groundig task, and uses comparative training to train target regions and language phrases, and has achieved good results on a variety of data sets, including closed-set and open-set detection

Since closed-set and open-set detection are very similar, it is definitely possible to link the two to make a better open-set detector

This paper proposes an open-set detector based on DINO[58], and achieved good results in target detection

Advantages of Grounding DINO over GLIP:

  • Based on the Transformer structure, it can be applied to both image and language data
  • Transformer structure can get more information from large data sets
  • DINO can optimize the model end-to-end without using post-processing (such as NMS)

How do some existing open-set detectors do it:

  • Use language model information to extend closed-set detectors to open-set detectors

There are three important modules of the closed-set detector:

  • backbone: extract image features
  • neck: feature enhancement
  • head: regression and classification, etc.

How to extend closed-set detectors to open-set detectors using language models:

  • Learn language-aware region embedding
  • In this way, each target area can be divided into the space corresponding to the language semantic information.
  • The key lies in the place where the neck or head outputs, using contrastive learning between region output and language features to help the model learn how to align these two multimodal information
  • Figure 2 shows an example of feature fusion in three different stages, neck (A), query initialization (B), head (C)

insert image description here

When is it better to perform feature fusion:

  • It is generally believed that the effect of feature fusion in the entire pipeline will be better
  • In order to be efficient, the retrieval structure similar to CLIP only needs to compare the last features.
  • But for open-set detection, the input of the model is image and text, so the effect of tight (and early) fusion is better, that is, more fusion is better
  • However, it is difficult for previous detectors (such as Faster RCNN) to introduce language features for fusion in these three stages, but the structure of the image Transformer is very similar to the structure of language, so the author of this paper designed three feature fusion devices, respectively. Neck, query initialization, head stage for fusion

Neck structure:

  • stacking self-attention
  • text-to-image cross-attention
  • image-to-text cross attention

Head:

  • Initialization of query: use language-guided query selection method to initialize
  • How to improve query feature expression: perform cross-attention on image and text as cross-modality decoder

Many existing open-set object detectors test their performance on new categories, as shown in Figure 1b

But the author believes that as long as it is an object that can be described, it should be considered

This paper names this task as Referring Expression Comprehension (REC), that is, referring to understanding

The authors show some examples of REC on the right side of Fig. 1b

The author conducted experiments on the following three data sets:

  • closed-set
  • open-set
  • referring
    insert image description here

2. Method

Grounding DINO will output multiple [object boxes, noun phrases] pairs based on a given input (image, text)

As shown in Figure 3, the model will box out the cat and table in the input image according to the input image and text description 'cat' and 'table'

Both object detection and REC tasks can be aligned using this pipeline, similar to GLIP:

  • How to achieve target detection: use the names of all categories as input text
  • How to implement REC: For each input text, REC only needs to return a bbox, so the author uses the maximum score of the output target as the output of REC

The structure of Grounding DINO: the structure of dual-encoder-single-decoder, the overall structure is shown in Figure 3

  • An image backbone to extract image information
  • A text backbone to extract text information
  • A feature enhancer to fuse image and text information
  • A language-guided query selection module for query initialization
  • A cross-modality decoder for box correction

The operation process for each (image, text) pair is as follows:

  • First, use the image backbone and text backbone to extract the original image features and text features
  • Then, these two sets of features are input into the feature enhancer module for cross-modal feature fusion to obtain cross-modal fusion features
  • Next, use the language-guided query selection model to select the cross-modal feature query from the image feature, and input the cross-modal decoder to advance the required features from the two modal features and update the query
  • Finally, the output query of the last layer of decoder is used to predict the object box and extract the corresponding phrases

insert image description here

2.1 Feature extraction and enhancement

Given an (Image, Text) pair, extract multi-level image features from a Swin Transformer-like structure and text features from a BERT-like structure

Then use DETR-like detectors to output the detection results

After the features are extracted, two sets of features are input into the enhancer for cross-modal feature fusion. The enhancer structure includes multiple enhancer layers, one of which is shown in Figure 3 block2.

Use Deformable self-attention to enhance image features, use ordinary self-attention to enhance text features

Like GLIP, this article also uses two cross-attentions for feature fusion:

  • image-to-text
  • text-to-image

2.2 Language-Guided Query Selection

In order to better use input text to guide target detection, Grounding DINO designed a language-guided query selection model to select features that are more relevant to input text as decoder queries. The pytorch pseudocode is shown in Algorithm 1:

  • num_query: the number of queries in the decoder, set to 900 in actual use
  • bs:batch size
  • People: feature dimension
  • num_img_tokens: the number of image tokens
  • num_text_tokens: the number of text tokens

insert image description here

The output of the language-guided query selection module:

  • num_query index, you can initialize queries according to the index of this output

2.3 Cross-Modality Decoder

As shown in Figure 3 block3, the author uses the cross-modality decoder to combine the features of image and text

insert image description here

2.4 Sub-sentence level text feature

The previous work can obtain two text prompts, as shown in Figure 4

  • Sentence level representation: As shown in Figure 4a, the features of the entire sentence are encoded into one feature. If a sentence has multiple phrases, these phrases will be extracted and other words will be ignored.
  • Word level representation: As shown in Figure 4b, all words in a sentence will be associated with encoding, which will introduce unnecessary dependencies, and some irrelevant words will also be associated

Based on the problems of the above two encoding methods, the author proposes a sub-sentence level expression method, which is to only carry out associated learning on the words in the sub-sentence, and will not introduce unnecessary connections
insert image description here

2.5 Loss Function

  • Return loss: L1 loss and GIoU loss
  • Classification loss: Contrastive learning loss (comparison between predicted and language token)

3. Effect

The authors conducted experiments on three different settings:

  • Closed set: COCO detection dataset
  • Open collection: zero-shot COCO, LVIS, ODinW
  • Referring detection:RefCOCO/+/g

Setup details:

  • The authors trained two model variants:
    • Grounding-DINO-T(swin-T)
    • Grounding-DINO-L(swine-L)
  • text backbone 为 BERT-base(from Hugging Face)

3.1 zero-shot transfer of grounding DINO

1、COCO Benchmark:

insert image description here

2、LVIS Benchmark

insert image description here

3、ODinW Benchmark

insert image description here

3.2 Referring Object detection

insert image description here

3.3 Ablations

Because the author proposed the tight fusion mode, in order to verify whether this method is useful, the author removed some fusion blocks, and the results are shown in Figure 6

All models are trained based on O365 using Swin-L, and the results prove that tighter fusion can improve the final effect

insert image description here

3.4 From DINO to Grounding DINO

If training Grounding DINO directly from scratch, it would be time-consuming and laborious. All authors tried to use the trained DINO weights, frozen some of the weights shared by the two models, and fine-tuned the parameters of other parts. The results are shown in Table 7.

The results show that using DINO pre-trained weights and only training text and fusion blocks can achieve the same effect as retraining

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/jiaoyangwm/article/details/130095817