[Multimodality] 23. RO-ViT | Transformer-based development vocabulary target detection (CVPR2023)

insert image description here

论文:Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Code: None

Source: CVPR2023

contribute:

  • The RO-ViT proposed in this paper solves the positional embedding problem between image-text pretraining and open-vocabulary object finetuning
  • It proves that image-text pretraining uses focal loss better than CE loss
  • Improved fine-tuning for open-vocabulary object detection using novel object proposals
  • Get SOTA 32.4 APr on LVIS, surpassing the current best method 6.1 APr

1. Background

Recently, the open-vocabulary detection task (OVD) has received a lot of attention. It was proposed to solve the limitations of traditional target detection. The biggest feature of open-vocabulary target detection is that it regards categories as text embeddings rather than discrete ids, so , open-vocabulary object detection can more flexibly predict categories not seen during training.

Many existing methods use a large number of image-text pairs for pre-training to introduce rich semantic information to the model. Many methods use CNN, but with the stronger demand for image understanding and the development of multi-modal tasks appears, it is also important to use a vision transformer to achieve

We know that many existing methods use pre-trained vision-language models, and then fine-tune them to solve the gap between image-level pre-training and object-level fine-tuning

This paper proposes RO-ViT, which migrates the pre-trained vision transformer to region-aware to achieve open-vocabulary target detection

The biggest difference between this paper and the previous method is that the author of this paper explores how to better use the vision transformer to pre-train VLMs, which is more suitable for open vocabulary detection

Then use the pre-trained weights to initialize the backbone of the detector, freeze the backbone and train special components such as the neck and head of the detector

2. Method

insert image description here

2.1 Basic content

1、contrastive image-text pretraining

The general contrastive learning is a two-tower structure, consisting of image encoder and text encoder

  • image encoder: can be CNN or ViT
  • text encoer: generally transformer

The goal of contrastive learning is to shorten the image-text distance of a pair in the embedding space, and distance the image-text distance of a non-pair

The loss generally used is softmax CE loss

2. Open vocabulary target detection

Use the basic category training, but the test needs to be able to detect both the basic category and the new category

The general method is to replace the original fixed-size fully connected classifier with text embedding, because the text embedding comes from the pre-trained text encoder, so the open semantic knowledge in the pre-training can be well preserved

The author uses the word "background" as the class word for the background class

During the training process, the author will give each region rrr calculates the corresponding detection scorepi p_ipi, the calculation method is to calculate the cosine similarity of the RoI-Align feature (region embedding) and the text embedding of the basic category, and then use softmax normalization

During the test, the text embedding was extended to the embedding of the basic category and the new category, and the background was added, and RoI-Align was used on the output feature map of ViT backbone to obtain region iii 's VLM embedding, and calculate the cosine similarity of this region embedding and text embedding, get region scorezi z_izi, the detection score is calculated as follows, α , β ∈ [ 0 , 1 ] \alpha, \beta \in [0,1]a ,b[0,1 ] Used to control the weights of the base category and the new category

The author uses the pre-trained ViT model to initialize the backbone of the detector

2.2 Region-aware Image-text Pretraining

The existing vision-language model basically uses the whole picture and text to match

However, this pre-training does not consider the relationship between region-level features and text tokens, which in turn is important for developing lexical object detection

Therefore, the author proposed a new Cropped Positional Embedding (CPE) method to solve the gap between image and region, and found that it is beneficial to use focal loss to mine hard samples

CPE:

  • In transformer, positional embedding is very important. It can preserve the relative position of each element. This information is very important for downstream recognition and positioning tasks.
  • However, there is a certain misalignment between the existing contrastive pretraining and the positional embedding of open-vocabulary detection fine-tuning. The pretraining method generally encodes the position of the whole image during training, and the downstream tasks also use the positional encoding of the whole image. . However, in detection fine-tuning, it is necessary to generalize the position code of the whole image to the code of the region

In order to solve this gap, the author proposed CPE, as shown in Figure 2:

  • First, for pretraining, upsample the positional embedding from the image size (224) to the detection task size (eg 1024)
  • Then, randomly crop a region from the upsampled positional embedding and resize it as the image-level positional embedding during pre-training
  • In this way, the model can regard the image as a randomly cropped region from a larger unknown image, rather than a whole image, which can be better adapted to downstream detection tasks

insert image description here

CPE visualization:

  • Each small grid is the cosine similarity between a patch and other patches
  • Similar patches have more similar positional encodings

insert image description here

Focal loss:

The author believes that more detailed control of the weight of hard samples is better than using CE loss

Assumptions:

  • vi v_ivisum li l_iliIs the normalized image embedding and text embedding

  • Image-to-text (I2T) comparative learning loss is set to CE loss and Focal loss for comparison, the formula is as follows

  • Text-to-image (T2I) comparison learning loss and I2T are symmetrical

    insert image description here

  • The total loss is the sum of the two losses

2.3 Open-vocabulary Detector Finetuning

Although the backbone can be initialized with pre-trained weights, the neck and head of the detector are brand new

Existing methods generally do not perform proposal generation on new or unmarked classes

However, this paper proposes a new method of generating proposals, using localization quality-based objectness (such as centerness, etc.) to measure the score of the proposal instead of using the object-or-not binary classification score to measure

OVD score: S i O V D = o i δ . s i O V D S_i^{OVD}=o_i^{\delta} .s_i^{OVD} SiO V D=oid.siO V D, oi δ o_i^{\delta}oidis the predicted objectness score

3. Effect

3.1 Details

Pre-training:

  • The pretraining of this article is trained by the author from scratch, using ViT-B/16 and ViT-L/16 as image encoder
  • The input image size is 224x224, the patch size is 16x16, a total of 14x14 positional embeddings
  • In order to generate CPE, the author first interpolates the positional embedding to 64x64, then randomly crops a region (scale ratio is [0.1, 1.0], aspect ration is [0.5, 2.0]), and then resizes the region crop to 14x14, driving to the patch embedding
  • Use global average pooling in the last layer of ViT to get image embedding
  • The text encoder is a 12-layer transformer, and the longest text encoder is 64
  • Dataset: LAION-2B [44]

Details of downstream assays:

  • LVIS: iters = 46.1k,img size =1024,batch = 256,SGD weight decay 1e-4,lr 0.36,momentum=0.9
  • COCO:iters = 11.3k,img size =1024,batch = 128,SGD weight decay 1e-2,lr 0.02,momentum=0.9
  • Using the CLIP prompt template, average the text embeddings for each category
  • Use OLN-RPN in the RPN stage, use centerness as objectness, there is an anchor on each position, use IoU loss, RPN NMS threshold=0.7 during training, and 1.0 during testing

3.2 Open vocabulary object detection effect

HVAC:

  • Use the basic category training, rare category as a new category to test, test 3 times to take the average
  • APr achieved 32.4

insert image description here

COCO:

  • Train on 48 base classes, test on 17 new classes

insert image description here

3.3 Image-text retrieval

zero-shot image-text retrieval on coco and Flickr30k

insert image description here

3.4 Transfer object detection

insert image description here

3.5 Ablation experiment

insert image description here

Guess you like

Origin blog.csdn.net/jiaoyangwm/article/details/132019168