[Image Segmentation] Interpretation of SEEM (Segment Everything Everywhere All at Once) principle of large visual model

Paper address : https://arxiv.org/abs/2304.06718
Test code : https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once

From: Wisconsin-Madison, Microsoft, HKUST, etc.

summary (effect)

As the demand for interactive artificial intelligence systems grows, the comprehensive study of human-AI interaction in vision is also inspired by the development of a general interface for prompt-based LLM. In this paper, we propose SEEM, a fast, interactive model for Segment everything in the image at once .

1. Text prompt . Generate masks from user-input text for one-key segmentation (SEEM can adapt to various types of input images in the fields of cartoons, movies, and games).

insert image description here
2. Image hints . Given an image of an Optimus Prime truck, it is possible to segment Optimus Prime on any target image

insert image description here

3. Tap & doodle hints . SEEM is able to segment objects with similar semantics on the target image by simply clicking or scribbling on the reference image.
insert image description here
4. In addition, SEEM is very aware of spatial relationships. After the upper left zebra is graffitied, the leftmost zebra will also be segmented.

insert image description here
5. Video segmentation:

insert image description here
insert image description here

SEEM has four requirements:

1) by introducing different types of general hint engines , including points, boxes, scribbles, masks, text, and reference regions of another image;
2) by learning a joint visual-semantic space for visual and textual hints, dynamic query for reasoning , as shown in the figure above;
3) Conversation history information is preserved through mask-guided cross-attention by incorporating learnable memory cues
; 4) Text queries are encoded using a text encoder and masked labels for open vocabulary segmentation .

When SEEM learns to compose different types of cues in a unified representation space, it shows a powerful ability to generalize to unseen user intent. SEEM can efficiently handle multiple rounds of interactions with a lightweight cue decoder.
insert image description here

2. Foreword

The success of large language models (LLMs) like ChatGPT shows the importance of modern AI models in interacting with humans. The ability to interact with humans requires a user-friendly interface that can accept as much human input as possible and produce responses that humans can easily understand . In NLP, this general interface has emerged and developed for some time, from early models such as GPT and T5 (Exploring the limits of transfer learning with a unified text-to-text transformer), to some more advanced Technologies such as prompt and chain of thought. SAM supports multiple prompts. However, the SAM shown in the figure below only supports limited interaction types, such as points and boxes, and does not support advanced semantic tasks because it does not output semantic labels (in the figure, SEEM is in the two interaction methods, such as the reference area of ​​the example image and semantic space have richer context).

insert image description here

The article advocates a common interface, using multimodal prompts to segment everything! Generality : SEEM models can handle any combination of input cues (points, masks, text, boxes, or even another image’s reference region, forming a cue in the same joint visual-semantic space), leading to strong compositionality . Interactivity , we further introduce memory hints to compress previous segmentation information and then communicate with other hints. For semantic awareness , our model provides an open set of semantics for any output segmentation. All 5 different types of cues are mapped to the joint visual-semantic space, enabling unseen user cues via zero-shot adaptation. By training on different segmentation tasks, the model is able to handle various cues .

In addition to strong generalization ability, SEEM runs very fast. We use hints as input to the decoder. Therefore, when interacting with humans for multiple rounds, the model only needs to run the feature extractor once at the beginning . In each iteration, we just need to run the lightweight decoder again with new hints . When deploying a model, it is common to run a heavy feature extractor on the server and a relatively lightweight decoder on the user's machine to reduce network latency across multiple remote calls.

1. Design a unified prompting scheme that can encode various user intents into a joint visual-semantic space that is general, compositional, interactive, and semantically aware, leading to zero for segmentation prompts -shot ability

2. Integrating the newly designed hinting mechanism into a lightweight decoder for all segmentation tasks, a general interactive segmentation interface SEEM is constructed.

3. Experiments and visualizations on a number of segmentation tasks, including closed-set and open-set plenoptic segmentation, interaction segmentation, reference segmentation, and combined hint segmentation tasks, demonstrate performance.

3. Related work

Closed-Set Segmentation
General segmentation techniques include several subtasks, including instance segmentation, semantic segmentation, and plenoptic segmentation , each of which focuses on a different semantic level. For example, the goal of semantic segmentation is to identify and label each pixel in an image according to its corresponding semantic class. On the other hand, instance segmentation involves grouping pixels belonging to the same semantic class into separate object instances. In recent years, Transformer-based (DETR) models have made significant progress in segmentation tasks. However, these methods cannot recognize missing objects in the training set, which limits the model to a limited vocabulary size.

Open Set Segmentation
Reference segmentation models aim at language description segmentations, which are essentially open vocabularies. However, due to limited reference segmentation data, trained models often perform well on target datasets but are difficult to extrapolate to practical applications. Recently, some models have proposed many open-vocabulary segmentation models, which use large pre-trained visual-linguistic models such as CLIP to transfer visual-semantic knowledge by freezing or adjusting their weights. Recently, X-Decoder proposed a simplistic approach to various visual-linguistic tasks of segmentation and open-vocabulary segmentation. To increase the size of the vocabulary, OpenSeeD proposes to improve segmentation using a large amount of detection data and a joint training method. ODISE utilizes a text-to-image diffusion model as the backbone for open vocabulary segmentation.

Interactive Segmentation
Interactive segmentation is to segment objects by interactively taking user input. In general, interaction types can take various forms such as click, box, polygon, and scribble, with click-based interaction models being the most prevalent. SAM proposes a fast segmentation model trained on 11 million images, showing strong zero-shot performance . It uses user interaction as a cue for general segmentation. But SAM produces segmentations that have no semantic meaning. And the prompt types are limited to point, box and text .

Four, method

SEEM employs a general encoder-decoder architecture with complex interactions between queries and hints, as shown in Figure (a) below. Given an input image I∈R H×W×3, an image encoder First used to extract image features z , the SEEM decoder predicts based on the interaction of query output O m h (mask embedding) and O c ​​h (class embedding) with visual, textual and memory cues P t , P v , P m Mask M and Semantic C.
insert image description here
(a) On the left is an overview of the model. First, features and cues are encoded into a joint visual-semantic space by their corresponding encoders or samplers. Whereas, learnable queries are initialized randomly . The SEEM decoder takes queries, features and hints as input and output, and uses class and mask embeddings for mask and semantic prediction. The right part is the details of the SEEM decoder and visual sampler. (b) shows multiple rounds of interaction . Each round consists of a human cycle and a model cycle. In the human loop, the human receives the mask output of the last iteration and gives positive or negative feedback on the next round of decoding via visual cues. In the model loop, the model receives and updates memory hints for future predictions.
insert image description here

4.1 Multipurpose

In SEEM, we introduce visual cues Pv to handle all non-textual inputs such as points, boxes, scribbles and reference regions from another image . These non-text queries help disambiguate user intent when text prompts fail to identify the correct data segment. For interactive segmentation, previous work either converts spatial queries into masks and feeds them into the image backbone, or uses different hint encoders for each input type (point, box). The first method is too heavy to apply, requiring the image to pass through the feature extractor for each interaction. The second approach is difficult to generalize to invisible cues. To address these limitations, SEEM proposes a visual sampler (Fig. 3(a)) that converts various non-textual queries into visual cues located in the same visual embedding space: where Z ^ \hat{
insert image description here
Z }Z^ is from the target image (ieZ ^ \hat{Z}Z^ = Z) or a feature map extracted from a reference image, while s (box, scribble, polygon) is a user-specified sampling location. We first pool corresponding regions from image features via point sampling. For all visual cues, up to 512 point feature vectors are uniformly interpolated from the region specified by the cue. Another advantage of our method is that visual cues are naturally well aligned with textual cues, and the modelcontinuously learns a common visual-semantic spacepanoptic segmentationandreference segmentation

Panoptic Segmentation Panoptic Segmentation: It is required that each pixel in the image must be assigned a semantic label and an instance id.
Referring segmentation: Cross-modal segmentation, given a sentence description, segment the object area corresponding to the image

4.2 Composition

In practice, users need to use different or combined input types to achieve intents. There are two problems in common model training. First, training data usually only covers a single type of interaction (eg: blank, text, visual). Second, visual cues are used to unify all non-text cues and align them with text cues, but their embedding spaces are still intrinsically different. To solve this problem, we propose to match different types of prompts with different outputs. Considering that visual cues come from image features and text cues come from text encoders, we match the visual cues and text cues with mask embeddings Om h or class embeddings Och h respectively to select the matching output index.

insert image description here

where IoU mask is the intersection-over-union ratio between the ground truth mask and the predicted mask. The proposed split-matching method outperforms methods that only match Om h or Och h in all cues.

After training, our model is familiar with all prompt types and supports multiple combinations such as no prompt, one prompt type, or visual and textual prompts using the same model and weights. In particular, visual and textual cues can be simply concatenated and fed into a SEEM decoder even though it has never been trained as such.

4.3 Interactive.

Interactive segmentation usually cannot be done at one time and requires multiple interactions for refinement, similar to conversational agents like ChatGPT. SEEM proposes a new type of hints Pm and uses them to transfer mask knowledge from the previous iteration to the current iter. No additional modules are introduced here, only memory hints responsible for encoding historical information by using mask-guided cross-attention layers:

insert image description here

where Mp is the previous mask and Z is the image feature map. Therefore, intersecting attention only takes effect in the area specified by the previous mask. The updated memory cue P l m interacts with other cues via self-attention to convey historical information about the current round. This design can be easily extended to support simultaneous segmentation of multiple objects.

4.4 Semantic Awareness

SEEM provides semantic labels for masks of various hint combinations in a zero shot manner. Because visual cue features are aligned with textual features in a joint visual-semantic space. As shown in the figure below, the semantic labels will be directly computed by Och (the output of the visual query) and the text embedding of the vocabulary. Although we did not train any semantic labels for interactive segmentation, the computed logarithms are well aligned, benefiting from the joint visual-semantic space.
insert image description here

5. Experiment

Dataset and Settings SEEM is trained with three data types : pan-view segmentation, reference segmentation, and interaction segmentation. Using COCO2017 training panorama and interactive segmentation, a total of 10 7K segmented images were obtained. For reference segmentation, we use a combination of Ref-COCO, Ref-COCOg, and RefCOCO+ for COCO image annotation. All segmentation tasks are evaluated, including generic segmentation (instance/panoramic/semantic), reference segmentation and interactive segmentation.
Implementation details and evaluation metrics . The SEEM framework follows the X-Decoder framework, except for the decoder part (visual backbone, language backbone, encoder and seem decoder). For the vision backbone, we use FocalT [54] and DaViT-d3 (B) [9]. For the language encoder, we employ a UniCL or Florentine text encoder [55, 59]. The evaluation indicators for the segmentation task are PQ (all-optical quality) for all-optical segmentation, AP for instance segmentation, and mIoU for semantic segmentation. For interactive segmentation, user clicks are simulated by automatically comparing predicted segmentations with GT's. After one click on the image generates the predicted mask, the next click is placed at the center of the region with the largest segmentation error. Interactive segmentation performance is evaluated using the Number of Clicks (NoC) metric , which measures the number of clicks required to achieve a certain IoU, namely 85% and 90%, denoted as NoC@85 and NoC@90, respectively.

  1. interactive segmentation

Table 1 compares SEEM to state-of-the-art interactive segmentation models, achieving comparable performance to RITM, SimpleClick, etc., and very similar to SAM using ×50 more segmentation data than SEEM.
insert image description here

  1. General Segmentation
    A set of parameters is pre-trained on all segmentation tasks, and we directly evaluate its performance on general segmentation datasets.

  2. Referring Split

As shown in the table below, by adding visual combination cues , the reference segmentation performance is improved by 5.7, 3.6 and 4.2 points under the cIoU, mIoU and AP50 metrics of the miniature model, respectively. The gap was retrained on the base model and improved by 2.5, 1.5 and 0, 4 points, respectively. Specifically , this number is computed by the class embedding O ch (outtut-q-text). And when the bounds are calculated using mask embedding O m h (Output-Q-Visual), the bounds are even larger, as shown in the table below. Additionally, we benchmark general combinations (directly combining the output probabilities of visual and textual masks). 4. Ablation Experiment
insert image description here

When adding iterations and negative visual cues , the performance of generic segmentation drops slightly. Furthermore, the performance of the generic split drops even more if we train the model from scratch. As expected, the reference segmentation performance degrades when training from scratch. However, it was further reduced when negative visual cues were added. On the other hand, adding the number of iterations in the interaction segmentation task slightly improves the grounding performance. By adding iterations and negative visual cues, interactive segmentation performance gradually improves, while training from scratch surprisingly improves performance on the Pascal VOC dataset slightly.
In the table below, "Iter" means multiple rounds of iterative segmentation. "negative" means to add negative points during interactive segmentation.
insert image description here
5. Qualitative results

See summary.

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/130403228