Gangke & Microsoft | Semantic-SAM: Multi-granularity Semantic Universal Segmentation Model

Title: Semantic-SAM: Segment and Recognize Anything at Any Granularity
Paper: arxiv.org/pdf/2307.04…
Code: github.com/ux-decoder/…

guide

This paper presents Semantic-SAM , a general image segmentation model that can segment and recognize images at any desired granularity . The model has two key strengths: semantic awareness and rich granularity . In order to achieve semantic awareness, the paper integrates multiple data sets of different granularities and trains the classification of decoupled objects (object) and parts (Part) . This enables the paper's model to perform knowledge transfer in rich semantic information. To achieve the multi-granularity capability, the paper proposes a multi-choice learning scheme that enables each click point to generate multiple levels of masks corresponding to multiple ground-truth masks . Notably, this is the first attempt to jointly train a model on SA-1B, general and partial split datasets. Experimental results and visualization demonstrations show that the paper's model successfully achieves semantic awareness and granular richness. Furthermore, combining SA-1B training with other segmentation tasks such as panoptic and partial segmentation improves performance.

contribute

The main contributions of this paper are as follows:

  • **Multi-granularity richness:**Through the multi-choice learning scheme, the model is able to generate segmentation masks at multiple granularities, from the whole of the object to the detailed parts, and realizes the segmentation ability of different granularities. This multi-grain richness enables the model to better adapt to different segmentation tasks and application scenarios.

  • **Semantic Awareness:**By integrating multiple data sets and decoupling object and part classification, the model can transfer knowledge on semantic information and realize the ability to perceive rich semantic information.

  • Comprehensive experiments are conducted on multiple datasets and demonstrate the superior performance of the Semantic-SAM model in terms of semantic awareness and granular completeness. Meanwhile, the robustness and applicability of the model are demonstrated through experiments and analysis on several aspects of the model.

method

Dataset integration

The paper uses seven datasets containing masks of different granularity levels, including SA-1B, COCO panorama, ADE20k panorama, PASCAL part, PACO, PartImageNet and Objects365. These datasets provide rich semantic information covering object-level and part-level masks as well as category labels. Details can be found in Table 1, and the data distribution of different data types is shown in Figure 2.

Semantic-SAM

The overall flow of the model is shown in Figure 3. In Semantic-SAM, click on the anchor boxes in the unified format indicated b = ( x , y , w , h ) b = (x, y, w, h) are encoded into K content embeddings and a positional embedding respectively. Content embeddings are represented as a set of query vectors Q = ( q 1 , . . . , q K ) Q = (q_1, ..., q_K) ,其中每个查询向量 q i q_i 由粒度级别嵌入 q i level q_i^{\text{level}} 和查询类型嵌入 q i type q_i^{\text{type}} 组成。位置嵌入通过正弦编码实现:

使用图像编码器的输出特征 F F 作为输入,Semantic-SAM的掩码解码器将输入图像上的点击表示为:

DeformDec ( , , ) \text{DeformDec}(\cdot, \cdot, \cdot) 是一个可变形解码器,它接受查询特征、参考框和图像特征,并输出查询特征。每个查询特征 o i = ( c i , m i ) o_i = (c_i, m_i) 包含预测的语义类别 c i c_i 和掩码 m i m_i ,用于构建概念识别损失和掩码预测损失。

Training

Recognize Anything

论文训练模型时使用了不同类型的数据,其中包括包含object-level注释的数据(如COCO数据集),同时也包含了object-level和 part-level注释的数据(如Pascal Part数据集),而SA-1B数据集没有语义注释,但包含了各种语义级别的掩码。

为了更好地传递不同粒度的语义信息,论文提出了一种分离object和part识别的方法。使用共享的文本编码器对object和part进行编码,以实现它们的分别分割。需要注意的是,不同类型的数据共享相同的格式,但损失函数的设计因数据类型而异

Segment at any granularity

为了赋予模型多粒度的分割能力,论文使用多对多的匹配方法,通过重新组织数据和使用匈牙利算法,使单个点击能够预测多个级别的掩码。对于盒状输入和通用分割,论文采用了类似的方法,通过添加噪声盒子和使用可学习的令牌作为提示,实现了盒状区域的掩码生成和通用分割。这种方法允许论文在训练过程中进行更准确的预测和匹配。

如图5所示,与以往的交互式分割模型相比,语义sam与以往的分割模型有两个方面的不同,首先,训练模型输出所有可能的分割掩模。其次,输出粒度更丰富,以生成不同的输出掩模

实验

分割性能

结果方面,主要是COCO Val2017和SA-1B的一个子集(包含1000张图像)刷一下:

Semantic Segmentation

在表5中,评估了SAM和Semantic-SAM在COCO Val2017上的模型的1次点击mIoU。在相同的设置下,Semantic-SAM的性能优于SAM。

对比了SAM和Semantic-SAM在单击时输出的粒度上的表现。

In Figure 6, Semantic-SAM is compared with SAM, which is able to output better segmentation quality and richer granularity .

The paper also found that the content prompt (K=6 content prompt) embedding learning of each point corresponds to a fixed granularity . As shown in Figure 7, when the masks are visualized in a specific order corresponding to content embeddings, the masks are always arranged in ascending order in each row. This demonstrates that each content embedding in the model represents a level of semantic granularity.

Summarize

This paper introduces Semantic-SAM, which can segment and recognize any object at any desired granularity. Besides performing general open-vocabulary segmentation, Semantic-SAM demonstrates the advantages of semantic awareness and rich granularity. To achieve these advantages, the paper makes improvements in data, models, and training, taking advantage of datasets at multiple granularity and semantic levels, training methods for multiple-choice learning, and a general modeling framework. Comprehensive experiments and visualizations validate the semantic awareness and rich granularity of our model. Furthermore, Semantic-SAM is the first successful attempt to jointly train on SA-1B and other classical segmentation datasets. Experimental results also show that training with SA-1B can improve other tasks such as panoptic segmentation and partial segmentation.

Guess you like

Origin juejin.im/post/7258526520167219237