【ECCV 2022】"FindIt: Generalized Localization with Natural Language Queries" Translation Notes

FindIt: Generalized Localization with Natural Language Queries

Abstract: This paper presents FindIt, a simple and general framework for unifying a variety of visual localization and localization tasks, including reference expression understanding, text-based localization, and object detection. The key to the architecture of this paper is an efficient multi-scale fusion module, which is used to unify the different localization requirements among multiple tasks. Furthermore, we find that standard object detectors are very effective in unifying these tasks without requiring task-specific designs, losses, or precomputed detections. Our end-to-end trainable framework can flexibly and accurately respond to various referential expression, localization or detection queries, applicable to zero, one or many objects. By jointly training on these tasks, FindIt surpasses the state-of-the-art in referring expression and text-based localization, and shows better performance in object detection. Finally, FindIt generalizes better on out-of-distribution data and new categories than excellent single-task baselines. All of this is achieved through a single, unified and efficient model.

1 Introduction

Natural language enables people to flexibly perform descriptive queries on images. Interactions between textual queries and images link linguistic meaning to the visual world, helping to enhance understanding of object relationships, human intentions toward objects, and environmental interactions. The visual localization problem has been studied in academia, including tasks such as phrase localization, object retrieval and localization, language-driven instance segmentation [62_Flickr30k_Entities, 70_ReferItGame, 60_Revisiting_Image - Language_Networks , 68_Natural_Language_Object_Retrieval , 56_DMN , 80_Structured_Matching_for_ Phrase_Localization , 25_Segmentation_from_Natural_Language_Expressions , 21_Contrastive_Learning_for_Weakly_Supervised_Phrase_Grounding ] .
  Referential expression comprehension (REC) is one of the most popular visual localization tasks, which locates objects in images given referential text [90_Modeling_Context_in_Referring_Expressions , 55_Generation_and_Comprehension_of_Unambiguous_Object_Descriptions ,70_ReferItGame ]。

Guess you like

Origin blog.csdn.net/songyuc/article/details/132459850