CVPR 2023 | FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

CVPR 2023 | FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Target task: Open-Vocabulary Image Segmentation

  • semantic segmentation
  • instance segmentation
  • panoptic segmentation

main purpose

The Open Vocabulary learning paradigm extends the segmentation system to more general application scenarios. The existing customized design paradigm leads to fragmentation among various segmentation tasks, which hinders the unity of the segmentation model. Therefore, this paper is based on one-shot
training In the form of , a general model with unified parameters and structure is proposed for processing Open Vocabulary segmentation tasks.
Prompt is introduced to unify different tasks and category concepts to adapt to different tasks and scenarios.

Open Vocabulary Segmentation aims to segment unseen target categories during training. Existing methods can be mainly divided into two directions:

  • Map visual features to semantic space.
  • Cross-modal alignment with pretrained models, which exploits the zero-shot capabilities of pretrained cross-modal models like CLIP. This paper is closely related to this type of work.

The main work

Method overview

Two-stage approach:

  • Extract general mask proposal
  • Use CLIP to perform zero-shot classification on the masks generated in the first stage.

Training phase:

  1. Training is performed on the seen categories and corresponding labels.
  2. First use the Mask Proposal Extractor to encode the image to obtain the Fv (NxD) of Visual Concepts and the category-independent Mask set M (NxHxW). Use Focal and Dice loss supervision with specific task labels for Mask.
  3. Each iteration randomly selects a task label supervision from three task labels (semantic segmentation, instance segmentation and full-precision segmentation) to avoid gradient conflicts caused by cross-task training.
  4. Regarding the introduction of tasks and categories, Adaptive Prompt Learning is designed here to embed task and category into joint text embeddings Ft (CxD, C is the number of categories). The consine similarity matching map of Fv and Ft represents the probability of all class-agnositc masks to which they belong to the predicted category. Here class label supervision is used.

Here's the problem:

  • How is the embedding of the image obtained from the complete spatial features?
  • What is the purpose and method of using category labels to supervise this strategy?

Testing phase:

  1. Using the trained proposal extractor, a series of binary masks can be obtained using text guidance.
  2. Mask-level encoding is obtained using a pre-trained CLIP image encoder.
  3. Calculate the similarity between mask representation and text embedding.
  4. According to adaptive prompt learning, output task-related segmentation results.
  5. The category set used in the test contains seen and unseen classes.

Adaptive Prompt Learning

  • Design purpose: Encode any task and category into a textual representation.
  • Main content: Prompt-based learning strategy. Instead of fixedly putting all categories and task names into the same template, it adaptively converts task and category texts into a set of learnable vectors, which are concatenated as text embedding to facilitate model training.
  • Adaptive task prompt Pt: Multiple learning tasks can be packaged into the same framework, and the conflict problem of different task training can be alleviated. The specific operation is to add three task names to the set of learnable vectors to form a task prompt as a whole, and use CLIP text Encoder encodes Et.
  • Adaptive category prompt Pc: It can help the model to be compatible with more categories, expand to unseen categories, and improve open-domain performance. Similarly, in training, the combination of visible categories and learnable vectors used to get adaptive category prompt, through CLIP encoding Ec. Splice Et and Ec to obtain the joint task-category embedding Ft. Since the actual input category can be arbitrary, Ft can be migrated to the unseen category of the open vocabulary.

Semantic Context Interaction Module

The general visual representation obtained in Mask Proposal Extractor ignores category and task information, which can provide more reliable clues for comprehensive reasoning.

For this reason, the authors placed a semantic context interaction module between the text embedding Ft and the multi-scale visual feature Fv(z)(z: layer index) in the extractor, using Ft to update Fv, thereby emphasizing the visual aspects of a given text category feature.

This module is based on cross attention and some linear layers.

Test Time Prompt Tuning

This is a test time adaptation algorithm [ Fully test-time adaptation by entropy minimization ], used to ** refine the adaptive class prompt ** during testing, thereby improving cross-modal alignment for unseen categories.

  • Use the CLIP image encoder to encode the image after the mask operation using the predicted Mask map.
  • Calculate cosine similarity score with Nu unseen classes.
  • Calculate the entropy value corresponding to each sample:

  • Select the queries whose entropy is less than the threshold τ, and K corresponding sample codes can be obtained.
  • Calculate the entropy loss on the screened score map, which is used to optimize the parameters of the adaptive class prompt.

experimental performance

A special variant, "CLIP", is introduced in the experiments, which directly uses the pre-trained CLIP's text and visual encoders to match results.

Ablation analysis

single data set

across datasets

visualization

Guess you like

Origin blog.csdn.net/P_LarT/article/details/130158346