Welcome to the official WeChat account of "CVHub"!
Title: Segment Everything Everywhere All at Once
Paper: https://arxiv.org/pdf/2304.06718.pdf
Code: https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once
guide
Despite the growing need for interactive AI
systems , comprehensive studies of AI-human interactions in the domain of visual understanding (e.g., segmentation) remain scarce.
Recently, SAM
a hintable segmentation model was proposed for general segmentation tasks. Although SAM
it advances CV
the progress of large models, its segmentation results lack semantic meaning, and the types of hints are limited to points, boxes, and texts. In contrast, this study builds a more general segmentation system that can segment all objects with semantic meaning through a pre-trained model and is applicable to all types of interactions and hint combinations.
In this paper, inspired by the development of a generic interface based on hints ,prompt
we propose SEEM
a hintable, interactive model for segmenting all objects in an image at once. SEEM
Has four visions:
Multifunction
Versatility is achieved by introducing a flexible hinting engine, including points, boxes, scribbles ( scribbles
), masks, text and related areas of another image ;
Can be combined
Compositionality is achieved by learning a joint visual-semantic space to combine real-time queries for visual and textual cues, as shown in Figure 1;
interactive
Retaining dialog history information through mask-guided cross-attention by combining learnable memory cues for interaction ;
semantic awareness
Semantic awareness for open vocabulary segmentation is achieved by using a text encoder to encode text queries and mask labels .
In this paper, through a large number of experiments, the effectiveness SEEM
of . SEEM
Exhibits strong generalization ability, capable of learning to combine different types of cues in a unified representation space for unseen user intents. Furthermore, SEEM
multiple rounds of interactions can be efficiently handled using a lightweight hint decoder.
creative background
The success of large language models ( Large Language Model, LLMs
), such as ChatGPT
, demonstrates the importance of modern AI models in interacting with humans and provides a glimpse into artificial general intelligence ( AGI
). The ability to interact with humans requires a user-friendly interface that can accept as many types of human input as possible and generate responses that humans can easily understand. In the field of natural language processing ( NLP
NLP), such general-purpose interfaces have emerged and developed for some time, from early models such as GPT
and T5
to some more advanced techniques such as hints and thought chains. In the field of image generation, some recent works try to combine text cues with other types, such as sketches or layouts, to more accurately capture user intent, generate new cues, and support multiple rounds of AI interactions.
This paper proposes a general prompting scheme, which can interact with users through multiple types of prompts (such as text, click, image), and then build a general **"segment everything"** model SEEM
. The model employs Transformer
an encoder -decoder structure, feeds all queries as hints into the decoder, and uses image and text encoders as hint encoders to encode all types of queries, so that visual and textual cues are always aligned.
Furthermore, the model introduces memory cues to reduce previous segmentation information and communicates with other cues to enhance interactivity. SAM
Different from other works such as , this model supports multiple prompt types and has zero-shot generalization ability. Experimental results demonstrate SEEM
strong performance in many segmentation tasks, including closed-set and open-set panoptic segmentation, interactive segmentation, grounded segmentation, and segmentation tasks using multiple cues.
method
SEEM
is a model with a general encoder-decoder architecture but with complex query and hint interactions , as shown in Fig. 3(a). Given an input image I ∈ RH × W × 3 I ∈ R^{H×W×3}I∈RH × W × 3 , first use the image encoder to extract image featuresZZZ , then the SEEM decoder based on the visual, textual and memory cues〈P t , P v , P m 〉 〈P_t, P_v, P_m〉〈Pt、Pv、Pm〉 Interactive query outputQ h Q_hQhto predict the mask MMM and Semantic ConceptsCCC。
During training, Q h Q_hQhare replicated for panoptic segmentation, anaphoric segmentation, and interaction segmentation.
At inference time, learnable queries are initialized from the same set of weights, enabling zero-shot
composition . This design is inspired X-Decoder
by the successful practice of , but differs in that this allows for a generic image segmentation model with the following properties:
Multifunction
In addition to text input, a visual cue P v P_vSEEM
is introducedPvto handle all non-text input such as points, boxes, scribbles, and region references to another image.
When text cues cannot accurately identify the correct segmentation region, non-text cues can provide useful supplementary information to help accurately locate the segmentation region. Previous interactive segmentation methods usually convert spatial queries into masks, then feed them into image backbone networks, or use different hint encoders for each input type (point, box). However, these methods suffer from the problem of excessive weight or difficult generalization.
To address these issues, SEEM
the use of visual cues to unify all non-text inputs is proposed. These visual cues are uniformly represented in the form of tokens and reside in the same visual embedding space, so that all non-text inputs can be processed in the same way. To extract features for these visual cues, the model also introduces a method called a " visual sampler " for extracting location-specific features from feature maps of input images or reference images.
In addition, SEEM
a common visual-semantic space is continuously learned through panoptic and reference segmentation, allowing visual cues to be naturally aligned with textual cues to better guide the segmentation process. When learning semantic labels, the hint features are mapped to the same space as the text hints to compute the similarity matrix, which can better cooperate to complete the segmentation task.
Can be combined
Users can express their intent using different or combined input types, so a combined hint approach is crucial in practical applications.
However, there are two problems encountered during model training. First, training data typically only covers one interaction type (e.g., none, text, visual). Second, although we have used visual cues to unify all non-text cues and align them with text cues, their embedding spaces are still fundamentally different.
To address this issue, this paper proposes methods to match different types of cues with different outputs. After the model is trained, SEEM
the model becomes familiar with all prompt types and supports various combinations, such as no prompt, single prompt type, or using both visual and textual prompts. Remarkably, visual and textual cues can be simply concatenated and fed into SEEM
the decoder .
interactive
SEEM
By introducing a mnemonic cue P m P_mPmTo perform multiple rounds of interactive segmentation, the segmentation results are further optimized. Memory hints are used to pass segmentation results from previous iterations, encoding historical information into the model for use in the current round.
Different from the previous work which uses a network to encode the mask, a SEEM
mask-guided cross-attention mechanism is used to encode the historical information, which can more effectively utilize the segmentation historical information for the next round of optimization. Notably, this approach can also be extended to interactive segmentation of multiple objects simultaneously.
semantic awareness
Unlike previous class-agnostic interactive segmentation methods, SEEM
semantic labels are applied to masks from all types of cue combinations because its visual cue features are aligned with textual features in a joint visual-semantic space.
During training, although no semantic labels are trained for interactive segmentation, the mask embedding ( mask embeddings
) O hm O^m_h due to the joint visual-semantic spaceOhmand visual sampler ( visual sampler
) Q v Q_vQvThe similarity matrix between can be calculated, so that the calculated logits
can well aligned, as shown in Figure 3(a).
In this way, the query image can aggregate information from multiple examples during inference.
experiment
Our method has similar performance to models such as
RITM
, , andSimpleClick
, all of which use twice .SAM
SAM
50
Visual
Text
The effect is more significant than , and the highest precision is achievedVisal + Text
when hinting with .IOU
When adding iterations and negative visual cues, the performance of generic segmentation drops slightly. Also, if we train the model from scratch, the performance of the generic split drops even more.
SEEM supports user clicks or scribbles in any format. Moreover, it simultaneously gives semantic labels for segmentation masks, which is not possible
SAM
in .
The quoted text is displayed on the mask.
SEEM
Suitable for various types of input images in the field of cartoons, movies and games.
Given a reference image with simple spatial cues,
SEEM
it is possible to segment semantically similar content in different target images.
Summarize
This paper introduces SEEM
that the model can perform global segmentation for all semantics simultaneously, and can interact with users, accepting different types of visual cues from users, including click, box selection, polygon, scribble, text, and reference image segmentation . These cues ( prompt
) are mapped into the joint visual-semantic space by a cue encoder, making our model adaptable to various cues and flexible to combine different cues. Extensive experiments demonstrate that the model performs competitively on several open and interactive split benchmarks.
Of course, SEEM
it is not perfect, and its two main limitations are: the training data is limited in size , and SEEM
it does not support part-based segmentation . We can further improve model performance by leveraging more training data and supervision, while part-based segmentation can seamlessly learn from it without changing the model. Finally, I would like to thank you very much for the segmentation data set SAM
proposed . This is a very valuable resource, and we should make good use of it.
If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions! Welcome to add the editor's WeChat account: cv_huber, remark "CSDN", join the CVHub official academic & technical exchange group, and discuss more interesting topics together!