Visual large model series | SEEM : a segmented large model with stronger interactive capabilities than SAM and semantic awareness

Welcome to the official WeChat account of "CVHub"!

Title: Segment Everything Everywhere All at Once

Paper: https://arxiv.org/pdf/2304.06718.pdf

Code: https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once

guide

Figure 1. SEEM can handle any segmentation task

Despite the growing need for interactive AIsystems , comprehensive studies of AI-human interactions in the domain of visual understanding (e.g., segmentation) remain scarce.

Recently, SAMa hintable segmentation model was proposed for general segmentation tasks. Although SAMit advances CVthe progress of large models, its segmentation results lack semantic meaning, and the types of hints are limited to points, boxes, and texts. In contrast, this study builds a more general segmentation system that can segment all objects with semantic meaning through a pre-trained model and is applicable to all types of interactions and hint combinations.

In this paper, inspired by the development of a generic interface based on hints ,prompt we propose SEEMa hintable, interactive model for segmenting all objects in an image at once. SEEMHas four visions:

Multifunction

Versatility is achieved by introducing a flexible hinting engine, including points, boxes, scribbles ( scribbles), masks, text and related areas of another image ;

Can be combined

Compositionality is achieved by learning a joint visual-semantic space to combine real-time queries for visual and textual cues, as shown in Figure 1;

interactive

Retaining dialog history information through mask-guided cross-attention by combining learnable memory cues for interaction ;

semantic awareness

Semantic awareness for open vocabulary segmentation is achieved by using a text encoder to encode text queries and mask labels .

In this paper, through a large number of experiments, the effectiveness SEEMof . SEEM Exhibits strong generalization ability, capable of learning to combine different types of cues in a unified representation space for unseen user intents. Furthermore, SEEMmultiple rounds of interactions can be efficiently handled using a lightweight hint decoder.

creative background

Figure 2. Comparison with SAM

The success of large language models ( Large Language Model, LLMs), such as ChatGPT, demonstrates the importance of modern AI models in interacting with humans and provides a glimpse into artificial general intelligence ( AGI). The ability to interact with humans requires a user-friendly interface that can accept as many types of human input as possible and generate responses that humans can easily understand. In the field of natural language processing ( NLPNLP), such general-purpose interfaces have emerged and developed for some time, from early models such as GPTand T5to some more advanced techniques such as hints and thought chains. In the field of image generation, some recent works try to combine text cues with other types, such as sketches or layouts, to more accurately capture user intent, generate new cues, and support multiple rounds of AI interactions.

This paper proposes a general prompting scheme, which can interact with users through multiple types of prompts (such as text, click, image), and then build a general **"segment everything"** model SEEM. The model employs Transformeran encoder -decoder structure, feeds all queries as hints into the decoder, and uses image and text encoders as hint encoders to encode all types of queries, so that visual and textual cues are always aligned.

Furthermore, the model introduces memory cues to reduce previous segmentation information and communicates with other cues to enhance interactivity. SAMDifferent from other works such as , this model supports multiple prompt types and has zero-shot generalization ability. Experimental results demonstrate SEEMstrong performance in many segmentation tasks, including closed-set and open-set panoptic segmentation, interactive segmentation, grounded segmentation, and segmentation tasks using multiple cues.

method

Figure 3. SEEM architecture

SEEMis a model with a general encoder-decoder architecture but with complex query and hint interactions , as shown in Fig. 3(a). Given an input image I ∈ RH × W × 3 I ∈ R^{H×W×3}IRH × W × 3 , first use the image encoder to extract image featuresZZZ , then the SEEM decoder based on the visual, textual and memory cues〈P t , P v , P m 〉 〈P_t, P_v, P_m〉PtPvPm Interactive query outputQ h Q_hQhto predict the mask MMM and Semantic ConceptsCCC

During training, Q h Q_hQhare replicated for panoptic segmentation, anaphoric segmentation, and interaction segmentation.

At inference time, learnable queries are initialized from the same set of weights, enabling zero-shotcomposition . This design is inspired X-Decoderby the successful practice of , but differs in that this allows for a generic image segmentation model with the following properties:

Multifunction

In addition to text input, a visual cue P v P_vSEEM is introducedPvto handle all non-text input such as points, boxes, scribbles, and region references to another image.

When text cues cannot accurately identify the correct segmentation region, non-text cues can provide useful supplementary information to help accurately locate the segmentation region. Previous interactive segmentation methods usually convert spatial queries into masks, then feed them into image backbone networks, or use different hint encoders for each input type (point, box). However, these methods suffer from the problem of excessive weight or difficult generalization.

To address these issues, SEEMthe use of visual cues to unify all non-text inputs is proposed. These visual cues are uniformly represented in the form of tokens and reside in the same visual embedding space, so that all non-text inputs can be processed in the same way. To extract features for these visual cues, the model also introduces a method called a " visual sampler " for extracting location-specific features from feature maps of input images or reference images.

In addition, SEEMa common visual-semantic space is continuously learned through panoptic and reference segmentation, allowing visual cues to be naturally aligned with textual cues to better guide the segmentation process. When learning semantic labels, the hint features are mapped to the same space as the text hints to compute the similarity matrix, which can better cooperate to complete the segmentation task.

Can be combined

Users can express their intent using different or combined input types, so a combined hint approach is crucial in practical applications.

However, there are two problems encountered during model training. First, training data typically only covers one interaction type (e.g., none, text, visual). Second, although we have used visual cues to unify all non-text cues and align them with text cues, their embedding spaces are still fundamentally different.

To address this issue, this paper proposes methods to match different types of cues with different outputs. After the model is trained, SEEMthe model becomes familiar with all prompt types and supports various combinations, such as no prompt, single prompt type, or using both visual and textual prompts. Remarkably, visual and textual cues can be simply concatenated and fed into SEEMthe decoder .

interactive

SEEMBy introducing a mnemonic cue P m P_mPmTo perform multiple rounds of interactive segmentation, the segmentation results are further optimized. Memory hints are used to pass segmentation results from previous iterations, encoding historical information into the model for use in the current round.

Different from the previous work which uses a network to encode the mask, a SEEMmask-guided cross-attention mechanism is used to encode the historical information, which can more effectively utilize the segmentation historical information for the next round of optimization. Notably, this approach can also be extended to interactive segmentation of multiple objects simultaneously.

semantic awareness

Unlike previous class-agnostic interactive segmentation methods, SEEMsemantic labels are applied to masks from all types of cue combinations because its visual cue features are aligned with textual features in a joint visual-semantic space.

During training, although no semantic labels are trained for interactive segmentation, the mask embedding ( mask embeddings) O hm O^m_h due to the joint visual-semantic spaceOhmand visual sampler ( visual sampler) Q v Q_vQvThe similarity matrix between can be calculated, so that the calculated logitscan well aligned, as shown in Figure 3(a).

In this way, the query image can aggregate information from multiple examples during inference.

experiment

Our method has similar performance to models such as RITM, , andSimpleClick , all of which use twice .SAMSAM50

VisualTextThe effect is more significant than , and the highest precision is achievedVisal + Text when hinting with .IOU

When adding iterations and negative visual cues, the performance of generic segmentation drops slightly. Also, if we train the model from scratch, the performance of the generic split drops even more.

Figure 4. Click to split

SEEM supports user clicks or scribbles in any format. Moreover, it simultaneously gives semantic labels for segmentation masks, which is not possible SAMin .

Figure 5. Text-guided segmentation

The quoted text is displayed on the mask. SEEMSuitable for various types of input images in the field of cartoons, movies and games.

Figure 6. Visual reference segmentation

Given a reference image with simple spatial cues, SEEMit is possible to segment semantically similar content in different target images.

Summarize

This paper introduces SEEMthat the model can perform global segmentation for all semantics simultaneously, and can interact with users, accepting different types of visual cues from users, including click, box selection, polygon, scribble, text, and reference image segmentation . These cues ( prompt) are mapped into the joint visual-semantic space by a cue encoder, making our model adaptable to various cues and flexible to combine different cues. Extensive experiments demonstrate that the model performs competitively on several open and interactive split benchmarks.

Of course, SEEMit is not perfect, and its two main limitations are: the training data is limited in size , and SEEM it does not support part-based segmentation . We can further improve model performance by leveraging more training data and supervision, while part-based segmentation can seamlessly learn from it without changing the model. Finally, I would like to thank you very much for the segmentation data set SAMproposed . This is a very valuable resource, and we should make good use of it.


If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions! Welcome to add the editor's WeChat account: cv_huber, remark "CSDN", join the CVHub official academic & technical exchange group, and discuss more interesting topics together!

Guess you like

Origin blog.csdn.net/CVHub/article/details/130303952