sketch-detect

This paper proposes a new training paradigm that enables object detection without bounding box annotations or image-level class labels. 

Paper link: https://arxiv.org/pdf/2303.15149.pdf

Code link: http://www.pinakinathc.me/sketch-detect/

Sketches have been used by humans since prehistoric times to express and record objects with an expressive power unmatched even by language—recall that you want to use pen and paper (or a Zoom whiteboard) to The moment an idea is drawn. Sketch-centric research has also flourished in the past decade, covering traditional tasks such as classification and composition to more specific sketch tasks such as modeling visual abstraction, style transfer, and continuous stroke fitting, as well as some interesting applications, Such as converting a sketch to a classifier.

However, the expressiveness of sketches has only been explored in the form of sketch-based image retrieval (SBIR), especially the fine-grained variant. Tremendous progress has been made, and recent systems have reached a mature stage of commercial adaptation, a strong demonstration of the real impact that cultivating expressiveness in sketches can have. In this article, the author asks a question-what is the role of human sketching for the basic visual task of target detection? So the result the authors envision is a sketch-based object detection framework that can detect based on what you draw, that is, it can detect based on how you want to express it. For example, a sketch of "zebra eating grass" in the image above should detect "that" zebra from a herd of zebras (instance-aware detection), while also giving you the freedom to specify parts (part-aware detection), so , if you only want the "head" part of the "zebra", just draw a head.

Rather than designing a sketch-enabled object detection model from scratch, the authors' proposed approach demonstrates that a synergy between an intuitive base model such as CLIP and an off-the-shelf SBIR model can solve this problem quite elegantly— CLIP provides model generalization, and SBIR fills the (sketch→photo) gap. Specifically, we improve CLIP to build sketch and photo encoders (branches in the common SBIR model) by learning independent cue vectors for the two modalities separately. More specifically, during training, learnable cue vectors are pre-set into the input sequence of the first Transformer layer of CLIP's ViT backbone network while keeping the rest constant. We therefore inject model generalization into the learned sketch and photo distributions. Next, we design a training paradigm to adapt the learned encoder for object detection such that the region embeddings of detected boxes are aligned with sketch and photo embeddings from SBIR. This allows our object detector to be trained without requiring additional training photos from an auxiliary dataset.

To make our sketch-based object detector more interesting and general, the method further stipulates that it can also work in a zero-shot manner. To this end, we extend object detection from a predefined fixed-set setting to an open-vocabulary setting, following the approach of . Specifically, we replace the classification head in object detectors with prototype learning, which encodes query sketch features as a support set (or prototype). Next, CE loss is trained under a weakly supervised object detection (WSOD) setting, covering all possible categories or prototypes of instances. However, while SBIR is trained using object-level sketch/photo pairs, object detection is performed at the image level (multi-category). Therefore, in order to train an object detector using SBIR, we also need to bridge the gap between object and image-level features. To do this, we use a data augmentation technique that is very simple but very effective for resisting dataset noise and generalizing to unknown categories - the method randomly selects n = {1, . . . , 7} photos from the SBIR dataset , and tile them arbitrarily on a blank canvas (similar to CutMix).

method

We propose a new training paradigm that enables object detection without bounding box annotations or image-level class labels. Instead, we use sketch-based image retrieval for supervision. Therefore, the paper tackles three levels of tasks: (i) Fine-grained object detection — specifying regions of interest using fine-grained visual cues in sketches. (ii) Class-level object detection - specifying the classes of detected instances via sketches. (iii) Part-level object detection - detect specified parts (eg, "head" and "legs" of "horse").

background

First, the paper introduces some background on these target detection methods:

  1. The framework consists of two modules - object detection and sketch-based image retrieval.

  2. Faster-RCNN is a state-of-the-art supervised object detection framework. Therefore, this paper mainly uses the Faster-RCNN structure to complete the training. That is to say, it is a two-stage structure.

  3. A benchmark sketch-based image retrieval framework is trained using a sketch/photo feature extractor and a triplet loss.

  4. Fine-grained sketch-based image retrieval across categories is extended by using hard triplet and category discriminator losses.

Weakly Supervised Object Detection

 

 Localising Object Regions with Query Sketch 

 Prompt Learning for Generalised SBIR

 Bridging Object-Level and Image-Level 

While SBIR is trained using individual object-level sketch/photo pairs, object detection operates on image-level (multiple objects) data. In order to train an object detector using SBIR, we need to bridge this gap between object and image levels. Our solution is very simple - synthesize annotations of size (H × W) by randomly tiling n = {1,...,7} object-level photos in the SBIR dataset.

experiment

 

The data used in this paper are standard object detection datasets such as PASCAL-VOC and MS-COCO. The method proposed in this paper uses sketch-based image retrieval for supervision, training object detection without bounding box annotations or image-level class labels.

The table above demonstrates the quantitative performance of two types of sketch-based image retrieval (SBIR) models: zero-shot category-level SBIR (CL-SBIR) and cross-category fine-grained SBIR (CC-FGSBIR). The table shows the average precision (mAP) and precision of 200 (P@200) for CL-SBIR, and the accuracy of 1 and 5 for CC-FGSBIR. The table is divided into three sections based on the percentage of training data used: 100%, 70%, and 50%. Each section demonstrates the performance of three models: GRL, VKD and Ours. GRL and VKD are existing SBIR models, while Ours is the model proposed in this paper. The first row of each section shows the performance of the GRL model, the second row shows the performance of the VKD model, and the third row shows the performance of the proposed model (Ours). In all three parts, the proposed model outperforms the GRL and VKD models. The performance of the proposed model is highlighted in bold in the table. The table also includes acronyms such as GRL, VKD, and CCD, which stand for techniques such as Gradient Reversal Layer, Visual Knowledge Distillation, and Cross-Caption Decoder, respectively, for improving the performance of SBIR models. Overall, the table shows that the proposed model outperforms existing SBIR models in the zero-shot setting, where the model was not trained on the specific class being tested.

This paper overcomes the limitations of fixed-set classifiers in both supervised and weakly supervised scenarios by introducing a sketch-based object detection framework. The model performs detections from sketches drawn without knowing which class to expect at test time (zero-shot), and without additional bounding boxes (fully supervision) and class labels (weak supervision). The paper demonstrates an intuitive synergy between existing sketch-based models (such as SBIR) and base models (such as CLIP), rather than designing a model from scratch. These sketch models have been able to solve the task elegantly - CLIP provides model generalization capabilities, while SBIR is used to bridge the gap between (sketches → photos). The model is trained by adapting a learned encoder for object detection such that the region embeddings of detected boxes align with sketch and photo embeddings in SBIR. whaosoft  aiot  http://143ai.com

A limitation of this paper is that it currently treats multiple query sketches as independent query embeddings, which may not be suitable for complex scenarios of detecting multiple objects with meaningful spatial alignments. Future work could extend fine-grained object detection to semantic segmentation using complex sketches from the recently introduced FS-COCO dataset.

The sketch-based object detection framework proposed in this paper detects from the drawn sketches, allowing fine-grained object detection by specifying regions of interest using fine-grained visual cues in the sketches. The model does not need to know which class to expect at test time (zero samples), nor does it require additional bounding boxes or class labels. The model is built by combining base models such as CLIP and existing sketch models for sketch-based image retrieval (SBIR). The model is trained by adapting a learned encoder for object detection such that the region embeddings of detected boxes align with sketch and photo embeddings in SBIR. In the zero-shot setting of standard object detection datasets such as PASCAL-VOC and MS-COCO, the proposed method outperforms both supervised (SOD) and weakly supervised object detectors (WSOD) when evaluated.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/130694070