Open World Object Detection Towards Multimodal AGI

8980e63c86c07fb3b2677d9cc20871e7.gif

Author | Wang Bin, Xie Chunyu, Leng Dawei 

Editor in charge | Xia Meng

Produced | 360 Artificial Intelligence Research Institute

c8341609e79a0ac9f65ec497e5fa53ff.png

introduction

Target detection is a very important basic task in computer vision. Different from common image classification/recognition tasks, target detection requires the model to further give the position and size information of the target on top of the given target category. In CV Among the three major tasks (recognition, detection, and segmentation), it is in a key position connecting the preceding and the following. The current popular multi-modal GPT4 only has the ability of target recognition in terms of visual capabilities, and cannot complete more difficult target detection tasks. Recognizing the category, position and size information of objects in images or videos is the key to many artificial intelligence applications in real production, such as pedestrian and vehicle recognition in automatic driving, face locking in security monitoring applications, and medical image analysis. Tumor localization, etc.

Existing target detection methods such as YOLO series, R-CNN series and other well-known target detection algorithms have achieved high target detection accuracy and efficiency under the continuous efforts of researchers, but because existing methods need to be defined before model training The set of objects to be detected (closed set) makes them unable to detect objects outside the training set. For example, a model trained to detect faces cannot be used to detect vehicles; in addition, existing methods are highly dependent on human-labeled Data, when it is necessary to add or modify the target category to be detected, on the one hand, it is necessary to relabel the training data, and on the other hand, it is necessary to retrain the model, which is time-consuming and laborious. A possible solution is to collect massive images and manually label box information and semantic information, but this will require extremely high labeling costs, and using massive data to train detection models also poses severe challenges to researchers , Factors such as the long-tail distribution of data and the unstable quality of manual annotation will affect the performance of the detection model.

The article OVR-CNN[1] published in CVPR2021 proposes a new target detection paradigm: Open-Vocabulary Detection (OVD, also known as Open World Target Detection), to deal with the above-mentioned problem, namely detection scenarios for unknown objects in the open world. Due to the ability of OVD to identify and locate any number and category of objects without artificially expanding the amount of labeled data, it has attracted increasing attention from academia and industry since it was proposed, and has also brought new challenges to classic object detection tasks. The vitality and new challenges are expected to become a new paradigm for object detection in the future . Specifically, the OVD technology does not need to manually label a large number of pictures to enhance the detection ability of the detection model for unknown categories. Combining with the cross-modal model, the ability of the target detection model to understand the open world target is expanded by the cross-modal alignment of the image region features and the descriptive text of the target to be detected . The recent development of cross-modal and multi-modal large model work is very rapid, such as CLIP[2], ALIGN[3] and R2D2[4] (link: https://github.com/yuxie11/R2D2), and they The development of OVD also promoted the birth of OVD and the rapid iteration and evolution of related work in the field of OVD.

OVD technology involves the solution of two key issues: 1) how to improve the adaptation between region information and cross-modal large models; 2) how to improve the generalization ability of pan-category target detectors to new categories. From these two perspectives, we will introduce some related work in the field of OVD in detail below.

c86a2d56b31e7f7283123668dd017eeb.png

OVD basic flow diagram[1]

The basic concept of OVD: The use of OVD mainly involves two types of scenarios: few-shot and zero-shot. few-shot refers to the target category with a small number of artificially labeled training samples, and zero-shot refers to the absence of any artificially labeled training. The target category of the sample. On the commonly used academic evaluation datasets COCO and LVIS, the datasets are divided into Base and Novel, where Base corresponds to few-shot scenarios, and Novel corresponds to zero-shot scenarios. For example, the COCO dataset contains 65 categories, and the commonly used evaluation setting is that the Base set contains 48 categories, and only these 48 categories are used in few-shot training. The Novel set contains 17 categories, which are completely invisible during training. The test indicators mainly refer to the AP50 value of the Novel category for comparison.

1ce75a2efd0df434f18cb279904a153f.png

Open-Vocabulary Object Detection Using Captions

1597e02424cca9580fdea3d412bff4d9.png

Paper address: https://arxiv.org/pdf/2011.10678.pdf

Code address: https://github.com/alirezazareian/ovr-cnn

OVR-CNN is the Oral-Paper of CVPR2021, and it is also the pioneering work in the field of OVD. Its two-stage training paradigm has influenced many subsequent OVD work. As shown in the figure below, the first stage mainly uses image-caption pairs to pre-train the visual encoder, in which BERT (parameter fixed) is used to generate a word mask, and it is matched with ResNet50 loaded with ImageNet pre-trained weights for weakly supervised Grounding , the author believes that weak supervision will make the matching fall into a local optimum, so a multimodal Transformer is added for word mask prediction to increase robustness.

3dba84286a333799c0664fc17318d2f8.png

The training process of the second stage is similar to that of Faster-RCNN. The difference is that the Backbone for feature extraction comes from the 1-3 layers of ResNet50 pre-trained in the first stage. After RPN, the fourth layer of ResNet50 is still used for feature processing, and then Features are used for Box regression and classification prediction respectively. Classification prediction is the key sign that distinguishes OVD tasks from conventional detection. In OVR-CNN, input features into the V2L module obtained in the first stage of training (image vector to word vector module with fixed parameters) to obtain a graphic vector, which is then combined with the tag word vector Matches are made to predict categories. In the two-stage training, the Base class is mainly used to perform box regression training and category matching training on the detector model. Since the V2L module is always fixed, and the positioning ability of the target detection model is migrated to a new category, the detection model can identify and locate a new category of targets.

0f40d80d30712f276d08b82470d35390.png

As shown in the figure below, the performance of OVR-CNN on the COCO dataset far exceeds that of the previous Zero-shot target detection algorithm.

e61b7d9fcea17c6086157529d0787ea6.png

7ead9eccd4f3d07f75ef56046d355d00.png

RegionCLIP: Region-based Language-Image Pretraining

e26b4f962cc95bcf17658315c15d5f70.png

Paper address: https://arxiv.org/abs/2112.09106

Code address: https://github.com/microsoft/RegionCLIP

OVR-CNN uses BERT and multimodal Transformer for iamge-text pairs pre-training, but with the rise of cross-modal large model research, researchers began to use more powerful cross-modal large models such as CLIP and ALIGN to OVD task for training. The detector model itself mainly classifies and recognizes Proposals, that is, regional information. RegionCLIP[5] published in CVPR2022 found that the existing large models, such as CLIP, have much lower classification ability for clipped regions than for the original image itself. , to improve this, RegionCLIP proposes a brand new two-stage OVD scheme.

9d0a77cb13ddf84441fa6d73730fa0b2.png

In the first stage, the dataset mainly uses image-text matching datasets such as CC3M and COCO-caption for region-level distillation pre-training. specifically,

  1. Extract the vocabulary that originally existed in the long text to form a Concept Pool, and further form a set of simple descriptions about the Region for training.

  2. Use the RPN based on LVIS pre-training to extract Proposal Regions, and use the original CLIP to match and classify the extracted different Regions with the prepared descriptions, and further assemble them into forged semantic labels.

  3. The prepared Proposal Regions and semantic labels are compared and learned with Region-text on the new CLIP model, and then a CLIP model specialized in Region information is obtained.

  4. In the pre-training, the new CLIP model will also learn the classification ability of the original CLIP through the distillation strategy, and perform image-text comparative learning at the full image level to maintain the expressive ability of the new CLIP model for the complete image.

In the second stage, transfer learning is performed on the obtained pre-trained model on the detection model.

5093702912a32df21138b030b7dca052.png

RegionCLIP has further expanded the characterization capabilities of existing cross-modal large models on conventional detection models, and achieved even better performance. As shown in the figure below, RegionCLIP has achieved a greater improvement in the Novel category compared to OVR-CNN. RegionCLIP effectively improves the adaptability between Region information and multi-modal large models through one-stage pre-training, but CORA believes that when using a cross-modal large model with a larger parameter scale for one-stage training, Training costs will be very high.

15cca914627fef4a26e7a09db541ce59.png

f9ac22ef3d042d757547bfe4dffadcc4.png

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

0016223accc2afc85cb0d1301a202449.png

Paper address: https://arxiv.org/abs/2303.13076

Code address: https://github.com/tgxs002/CORA

CORA[6] has been included in CVPR2023. In order to overcome the two obstacles faced by the current OVD task proposed by it, a DETR-like OVD model is designed. As shown in the title of the article, the model mainly includes two strategies, Region Prompting and Anchor Pre-Matching. The former uses Prompt technology to optimize the regional features extracted by the CLIP-based region classifier, thereby alleviating the distribution gap between the whole and the region. The latter uses the anchor point pre-matching strategy in the DETR detection method to improve the OVD model’s ability to locate new categories of objects. generalizability.

f353ed8b6da90e2f8fadf0c30fdd06dd.png

There is a distribution gap between the overall image features of the original visual encoder of CLIP and the regional features, which in turn leads to low classification accuracy of the detector (this is similar to the starting point of RegionCLIP). Therefore, CORA proposes Region Prompting to adapt to the CLIP image encoder and improve the classification performance of regional information. Specifically, the entire image is first encoded into a feature map by the first 3 layers of the CLIP encoder, and then anchor boxes or prediction boxes are generated by RoI Align and merged into region features. It is then encoded by the fourth layer of the CLIP image encoder. In order to alleviate the distribution gap between the full-image feature map and regional features of the CLIP image encoder, learnable Region Prompts are set and combined with the features output by the fourth layer to generate the final regional features for comparison with text features. Matching, the matching loss uses a naive cross-entropy loss, and all parameter models related to CLIP are frozen during the training process.

7f602fd99249863cf9e4e42bcf585c50.png

CORA is a DETR-like detector model, similar to DETR, which also uses an anchor pre-matching strategy to generate candidate boxes in advance for box regression training. Specifically, anchor pre-matching is to match each label box with the closest set of anchor boxes to determine which anchor boxes should be regarded as positive samples and which ones should be regarded as negative samples. This matching process is usually based on IoU (intersection-over-union ratio). If the IoU of the anchor box and the label box exceeds a predefined threshold, it is regarded as a positive sample, otherwise it is regarded as a negative sample. CORA shows that this strategy can effectively improve the generalization of the ability to localize new categories.

However, using the anchor pre-matching mechanism will also bring some problems. For example, only when at least one anchor frame matches the label frame can normal training be performed. Otherwise, the label box will be ignored, hindering the convergence of the model. Furthermore, even if the label box obtains a more accurate anchor box, due to the limited recognition accuracy of the Region Classifier, the label box may still be ignored, that is, the category information corresponding to the label box is not aligned with the Region Classifier based on CLIP training. Therefore, CORA uses CLIP-Aligned technology to take advantage of the semantic recognition ability of CLIP and the positioning ability of pre-trained ROI to relabel the images of the training data set with less manpower. Using this technology, the model can be used during training. Match more label boxes.

16f2bde0eaf5c809ad1851e5df94e5d9.png

Compared with RegionCLIP, CORA has further improved the AP50 value of 2.4 on the COCO dataset.

554e1b1a7f68b8f2fbdb2afb26712b0c.png

Practice of 360 Artificial Intelligence Research Institute on OVD Technology

OVD technology is not only closely related to the development of the current popular cross/multimodal large models, but also inherits the technical cultivation of the past scientific researchers in the field of target detection. It is a successful connection between traditional AI technology and general-purpose AI capability research. OVD is a new future-oriented target detection technology. It is expected that the ability of OVD to detect and locate any target will in turn promote the further development of multi-modal large models, and is expected to become a multi-modal AGI. An important cornerstone of development.

360 Artificial Intelligence Research Institute's research and development priorities in recent years include: 21 years of cross-modal direction, 22 years of OVD and video analysis direction, 23 years of AIGC and multi-modal large model direction. With the support of massive graphic data at the bottom layer and long-term technology accumulation in multi-modal direction, 360 Artificial Intelligence Research Institute self-developed OVD large model, which has been implemented in the Internet, intelligent hardware and other businesses. It can be widely used in scenarios such as nursing care and equipment inspection. In the future, we plan to combine OVD with the multi-modal large model MLLM, endowing LLM with more important open-world target detection capabilities in addition to the basic visual capabilities, and make the multi-modal large model a step closer to general artificial intelligence.

ac190c3fc6610bc056eeb100023733d1.png

42adab2481d844b9afec6cf4144aec6d.png

Extra

In order to promote the popularization and development of OVD research in China, 360 Artificial Intelligence Research Institute and the Chinese Society of Image and Graphics jointly held the 2023 Open World Object Detection Competition (link: https://360cvgroup.github.io/OVD_Contest/), the current competition Enrollment is underway. The competition can help you find research fellows in the direction of OVD, communicate with them, get in touch with actual business scenario data, and experience the advantages and charm of OVD technology in actual production. Welcome to register and forward.

Introduction to 360 Artificial Intelligence Research Institute : 360 Artificial Intelligence Research Institute is affiliated to 360 Technology Center. Since its establishment in 2015, it has accumulated a large number of cutting-edge capabilities in artificial intelligence and machine learning, including but not limited to natural language understanding, machine vision and motion, and voice and semantic interaction. Competition champion/nomination award, published dozens of top conference and top journal papers. In terms of business implementation, the research institute provides intelligent and secure big data, Internet information distribution, enterprise digitalization, AIoT, smart cars and other 360 Group full business scenario support, supports tens of millions of hardware devices, hundreds of millions of users, and generates hundreds of billions of data volume. In 2023, we will focus on large language models, CV large models and multimodal large models to provide underlying technical support for the development and application of 360 Group and industry AIGC technologies.

About the Author:

Wang Bin : Algorithm Engineer, Visual Engine Department, 360 AI Research Institute, focusing on OVD

Xie Chunyu : Visual Engine Department of 360 Artificial Intelligence Research Institute, leader of graphic multi-module technology, focusing on cross-modal direction

Leng Dawei : Head of 360 Visual Engine Department, leading the CV team to carry out R&D work in the direction of large model + zero/few shot and multi-modal + cross-modal.

references

[1] Zareian A, Rosa K D, Hu D H, et al. Open-vocabulary object detection using captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 14393-14402.

[2] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PMLR, 2021: 8748-8763.

[3] Li J, Selvaraju R, Gotmare A, et al. Align before fuse: Vision and language representation learning with momentum distillation[J]. Advances in neural information processing systems, 2021, 34: 9694-9705.

[4] Xie C, Cai H, Song J, et al. Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework[J]. arXiv preprint arXiv:2205.03860, 2022.

[5] Zhong Y, Yang J, Zhang P, et al. Regionclip: Region-based language-image pretraining[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 16793-16803.

[6] Wu X, Zhu F, Zhao R, et al. CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching[J]. arXiv preprint arXiv:2303.13076, 2023.

[7] Kirillov A, Mintun E, Ravi N, et al. Segment anything[J]. arXiv preprint arXiv:2304.02643,

4ebae0414ce2ce97e4441fe4a2cadb19.gif

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/131237977