Segment Anything论文解读

论文链接:https://arxiv.org/pdf/2304.02643.pdf

摘要:

本文提出Segment Anything (SA)项目:一个用于图像分割的新任务、模型和数据集。在数据收集循环中使用我们的高效模型,我们构建了迄今为止(到目前为止)最大的分割数据集,在1100万张授权和尊重隐私的图像上有超过10亿个掩码。该模型被设计和训练为可提示的,因此它可以将零样本迁移到新的图像分布和任务。评估了其在许多任务上的能力,发现其零样本性能令人印象深刻——通常与之前的完全监督结果竞争,甚至优于。我们将在https://segment-anything.com上发布Segment Anything模型(SAM)和相应的数据集(SA-1B),其中包含1B个掩码和1100万张图像,以促进对计算机视觉基础模型的研究。
在这里插入图片描述

1、简介

  Large language models pre-trained on web-scale datasets are revolutionizing NLP with powerful zero-shot and few-shot generalization [10]. These “base models” [8] can generalize beyond the tasks and data distributions seen during training. This capability is typically achieved through hint engineering, where hand-crafted text is used to prompt a language model to generate effective text responses for the task at hand. When scaled and trained using rich text corpora from the web, the zero- and few-shot performance of these models compares surprisingly well to (and in some cases even matches) fine-tuned models [10, 21]. Empirical trends show that this behavior improves with increasing model size, dataset size, and total training computation [56, 10, 21, 51].
  Basic models have also been explored in the computer vision field, albeit to a lesser extent. Perhaps the most prominent illustration aligns pairs of text and images from around the web. For example, CLIP [82] and ALIGN [55] use contrastive learning to train text and image encoders that align the two modalities. After training, engineered text cues can zero-shot generalize to new visual concepts and data distributions. This encoder can also be effectively combined with other modules to achieve downstream tasks such as image generation (e.g. DALL·E [83]). While much progress has been made in vision and language encoders, computer vision encompasses a wide range of problems beyond this scope, for many of which rich training data does not exist.
  The goal of this article is to establish a basic model for image segmentation. That is, this paper attempts to develop a hintable model and pretrain it on a broad dataset using a task that enables strong generalization. Using this model, we aim to use prompt engineering to solve a series of downstream segmentation problems on new data distributions.
The success of this program depends on three components: tasks, models, and data. To develop them, this paper addresses the following issues regarding image segmentation.

  • What tasks will achieve zero-shot generalization?
  • What is the corresponding model architecture?
  • What data can support this task and model?

  这些问题错综复杂,需要一个全面的解决方案。首先定义一个可提示的分割任务,足够通用,可以提供一个强大的预训练目标,并实现广泛的下游应用。此任务需要一个支持灵活提示的模型,并可以在提示时实时输出分割掩码,以允许交互使用。为了训练我们的模型,我们需要一个多样化的、大规模的数据源。不幸的是,没有web规模的分割数据源;为了解决这个问题,我们构建了一个“数据引擎”,也就是说,我们在使用高效的模型来帮助收集数据和使用新收集的数据来改进模型之间进行迭代。接下来介绍每个相互关联的组件,然后是创建的数据集和证明方法有效性的实验。

  • 任务(§2)。在NLP和最近的计算机视觉中,基础模型是一个有希望的发展,可以通过使用"提示"技术对新数据集和任务进行零样本和少样本学习。受这一行工作的启发,本文提出了promptable分割任务,目标是在给定任何分割提示时返回有效的分割掩码(见图1a)。提示符只是指定要在图像中分割的内容,例如,提示符可以包括识别对象的空间或文本信息。有效输出掩码的要求意味着,即使提示是模糊的,并且可能指向多个对象(例如,衬衫上的一个点可能表示衬衫或穿着它的人),输出也应该是其中至少一个对象的合理掩码。将提示分割任务作为预训练目标,并通过提示工程解决一般的下游分割任务。
  • Model (§3). Cueable segmentation tasks and goals of real-world use impose constraints on the model architecture. In particular, the model must support flexible hints, require real-time amortized computation of masks to allow interactive use, and must be able to sense ambiguity. A simple design satisfies all three constraints: a powerful image encoder computes image embeddings, a cue encoder embeds cues, and then combines both sources of information in a lightweight mask decoder that predicts segmentation masks . We call this model the Segment Anything model, or SAM for short (see Figure 1b). By separating SAM into an image encoder and a prompt encoder/mask decoder, the same image embedding can be reused in different prompts (and its cost amortization). Given an image embedding, the prompt encoder and mask decoder take 50ms to predict the mask from the prompt in a web browser. Focusing on point, box, and mask prompts, preliminary results are also presented using free-form text prompts. To make SAM ambiguous, it is designed to predict multiple masks for a single cue, allowing SAM to handle ambiguity naturally, as in the shirt and person examples.
    在这里插入图片描述
  • Data Engine (§4). In order to achieve strong generalization to new data distributions, it is necessary to train SAM on a large and diverse mask set instead of any existing segmentation dataset. While a typical approach to base models is to obtain the data online [82], the masks themselves are not rich, so we need another strategy. Our solution was to build a “data engine” where we co-develop our model with in-the-loop model dataset annotations (see Figure 1c). Our data engine has 3 stages: assisted manual, semi-automatic and fully automatic. In the first stage, SAM helps annotators annotate masks, similar to classic interactive segmentation settings. In the second stage, SAM can automatically generate masks for a subset of objects by prompting it with possible object locations, and the annotator focuses on annotating the remaining objects, helping to increase mask diversity. In the final stage, SAM is prompted with a regular grid of foreground points, producing an average of 100 high-quality masks per image.
    在这里插入图片描述
  • 数据集(§5)。最终数据集SA-1B包括来自11M授权和保护隐私图像的超过1B个掩码(见图2)。使用数据引擎的最后阶段完全自动收集的SA-1B的掩码比任何现有的分割数据集[66,44,117,60]多400倍,并且经过广泛验证,掩码具有高质量和多样性。除了用于训练SAM使其健壮和通用外,我们希望SA-1B成为旨在建立新的基础模型的研究的宝贵资源。
  • 负责任的AI(§6)。研究和报告了使用SA-1B和SAM时潜在的公平问题和偏见。SA-1B中的图像涵盖了一组地理和经济上不同的国家,我们发现SAM在不同人群中的表现相似。总之,我们希望这将使我们的工作对现实世界的用例更加公平。我们在附录中提供了模型卡和数据集卡。
  • 实验(§7)。我们对SAM进行了广泛的评估。首先,使用一套不同的23个分割数据集,SAM从单个前景点产生高质量的掩码,通常只略低于手动标注的地面真实值。在使用prompt工程的零样本迁移协议下,对各种下游任务取得了一致强大的定量和定性结果,包括边缘检测、目标建议框生成、实例分割,以及对文本到掩码预测的初步探索。这些结果表明,SAM可以与prompt engineering一起开箱即用,以解决涉及SAM训练数据之外的物体和图像分布的各种任务。然而,正如我们在§8中所讨论的,改进的空间仍然存在。
      为了研究目的,我们发布了SA-1B数据集,并使SAM在一个许可的开放许可证(Apache 2.0)下可在https://segment-anything.com上使用。我们还通过一个在线演示展示了SAM的功能。

2、分割任何物体任务

  从NLP中获得灵感,下一个token预测任务用于基础模型预训练,并通过提示工程[10]解决各种下游任务。为了建立一个分割的基础模型,本文旨在定义一个具有类似能力的任务。
在这里插入图片描述

  • Task. First, we translate the idea of ​​cues from NLP to segmentation, where the cues can be a set of foreground/background points, a rough box or mask, free-form text, or in general, anything that represents the information in the image that is to be segmented . The promptable splitting task, then, is to return a valid splitting mask given any prompt. The requirement for a "valid" mask simply means that even if the cue is ambiguous and may point to multiple objects (e.g., recall the shirt and person example, see Figure 3), the output should be for at least one of these objects Reasonable masking. This requirement is analogous to expecting language models to output coherent responses to ambiguous cues. This task was chosen because it leads to a natural pre-training algorithm and a general method of transferring zero samples to downstream segmentation tasks via hints.
  • Pre-training. The promptable segmentation task proposes a natural pre-training algorithm that simulates a sequence of prompts (e.g., points, boxes, masks) for each training sample and compares the model's mask predictions to the ground truth. This paper adopts this approach from interactive segmentation [109, 70], although unlike interactive segmentation where the aim is to eventually predict a valid mask after sufficient user input, the goal of this paper is to always predict a valid mask for any prompt, Even if the prompt is vague. This ensures that pre-trained models are effective in use cases involving ambiguity, including automatic annotation required by our data engine §4. We note that performing well on this task is challenging and requires specialized modeling and training loss choices, which we discuss in §3.
  • migrate. The pre-training task gives the model the ability to respond appropriately to any cues at inference time, so downstream tasks can be solved by engineering appropriate cues. For example, if we have a bounding box detector for cats, we can solve cat instance segmentation by providing the detector’s box output as a hint to our model. Generally, a series of practical segmentation tasks can serve as cues. In addition to automatic dataset annotation, five different example tasks are explored in the experiments in §7.
  • related tasks. Segmentation is a broad field: there is interactive segmentation [57, 109], edge detection [3], superpixelization [85], object proposal generation [2], foreground segmentation [94], semantic segmentation [90], instance segmentation [ 66], panoramic segmentation [59], etc. The goal of the cue segmentation task is to produce a broadly functional model that can be adapted to many (though not all) existing and new segmentation tasks through cue engineering. This ability is a form of task generalization [26]. Note that this is different from previous work on multi-task partitioning systems. In multi-task systems, a single model performs a fixed set of tasks, such as joint semantic, instance and panoptic segmentation [114, 19, 54], but the training and testing tasks are the same. An important distinction in our work is that a model trained for cue segmentation can be used as a component in a larger system to perform new and different tasks at inference time. For example, in order to perform instance segmentation, the cue segmentation model is combined with an existing combination of object detectors.
  • discuss. Hints and combinations are powerful tools that enable a single model to be used in a scalable manner, potentially accomplishing tasks not known at the time the model was designed. This approach is similar to how other base models are used, such as CLIP [82] , the text-image alignment component of the DALL E [83] image generation system. We expect that the design of composable systems driven by techniques such as prompt engineering will achieve wider applications than systems trained specifically for a fixed set of tasks. It is also interesting to compare cue segmentation and interactive segmentation through a compositional perspective: while the interactive segmentation model is designed for human users, models trained for cue segmentation can also be combined into a larger algorithmic system, as we will demonstrate.

3. Segment any object model

Next, we describe the Segmented Arbitrary Model (SAM) to achieve fast segmentation. SAM has three components, as shown in Figure 4: image encoder, flexible cue encoder and fast mask decoder. Transformer vision models [14, 33, 20, 62] were built with specific trade-offs for (amortized) real-time performance. We describe these components at a high level here, see §A for details.
在这里插入图片描述

  • Image encoder. Motivated by scalability and powerful pre-training methods, this paper uses MAE [47] to pre-train the Visual Transformer (ViT) [33] minimally adapted to handle high-resolution input [62]. The image encoder is run once per image and can be applied before prompting the model.
  • Prompt encoder. Consider two sets of cues: sparse (points, boxes, text) and dense (masks). We represent points and boxes via positional encoding [95] and sum the learned embeddings and free-form text for each prompt type using the off-the-shelf text encoder from CLIP [82]. Dense cues (i.e. masks) use convolutional embeddings and are summed with image embedding elements.
  • Mask decoder. The mask decoder efficiently maps image embeddings, hint embeddings, and output tokens to masks. The design is inspired by [14, 20] and employs a modification of the Transformer decoder block [103] followed by a dynamic mask prediction head. The modified decoder block updates all embeddings using cue self-attention and cross-attention in both directions (cue to image embedding and vice versa). After running two blocks, we upsample the image embeddings, the MLP maps the output labels to a dynamic linear classifier, and then calculates the masked foreground probability for each image location.
  • Resolve ambiguities. Using one output, the model will average multiple valid masks if given a vague hint. To address this issue, we modified the model to predict multiple output masks for a single cue (see Figure 3). We found that 3 mask outputs are sufficient to solve most common cases (nested masks are usually up to 3 depths: whole, part, and subpart). During training, we only perform backpropagation on the minimum loss of the mask [15, 45, 64]. To rank masks, the model predicts a confidence score (i.e., estimated IoU) for each mask.
  • efficiency. The overall model design is largely driven by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on a CPU, and take about 50 milliseconds. This runtime performance enables our model to interact with prompts seamlessly and in real time.
  • Loss and training. We supervise mask prediction with a linear combination of focal loss [65] and dice loss [73] used in [14]. The promptable segmentation task is trained using a mixture of geometric cues (see §7.5 for textual cues). Following [92, 37], an interactive setting is simulated by randomly sampling cues in each mask for 11 rounds, allowing SAM to be seamlessly integrated into our data engine.

4. Split any data engine

  Since segmentation masks are not abundant on the Internet, we built a data engine to collect our 11b mask dataset SA-1B. The data engine has three stages: (1) a model-assisted manual annotation stage, (2) a semi-automatic stage in which automatically predicted masks and model-assisted annotations are mixed, and (3) a fully automatic stage in which Our model generates masks without input from annotators. We’ll describe each in detail next.

  • Assisted-manual stage. In the first stage, similar to classic interactive segmentation, a team of professional annotators label the mask by clicking on foreground/background object points using a browser-based interactive segmentation tool supported by SAM. Masks can be refined using the PixelPrecise "Brush" and "Eraser" tools. Our model-assisted annotation runs in real-time directly in the browser (using pre-computed image embeddings), allowing for a truly interactive experience. We impose no semantic constraints on annotated objects, and annotators freely label “things” and “stuff” [1]. We recommend that annotators tag objects that they can name or describe, but do not collect those names or descriptions. Annotators are asked to label objects by prominence and are encouraged to move on to the next image when the mask annotation exceeds 30 seconds.
  • At the beginning of this phase, the SAM is trained using common public segmentation datasets. After sufficient data annotation, the SAM is retrained using only the newly annotated masks. As more masks were collected, the image encoder was extended from ViT-B to ViT-H, and other architectural details were developed; we retrained our model a total of 6 times. As the model improves, the average annotation time per mask decreases from 34 seconds to 14 seconds. We note that 14 seconds is 6.5 times faster than COCO's mask annotation [66] and only 2 times slower than bounding box labeling of extreme points [76, 71]. With the improvement of SAM, the average number of masks per image increases from 20 masks to 44 masks. Overall, we collected 4.3 million masks from 120,000 images at this stage.
  • semi-automatic stage. At this stage, our goal is to increase the diversity of masks to improve our model's ability to segment anything. To make the annotator focus on less salient objects, we first automatically detect confident masks. We then showed the annotators images pre-populated with these masks and asked them to annotate any other unannotated objects. To detect confidence masks, we trained a bounding box detector [84] on all first-stage masks using the common “object” category. At this stage, we collected an additional 5.9 million masks across 180k images (10.2 million masks in total). As in the first stage, we retrain the model periodically (5 times) on newly collected data. The average annotation time per mask returned to 34 seconds (excluding automatic masks) because these objects are more challenging. The average number of masks per image increased from 44 to 72 (including automatic masks).
  • fully automatic stage. In the final stage, annotation is fully automatic. This is possible because our model has two main enhancements. First, at the beginning of this stage, we collect enough masks to significantly improve the model, including different masks from the previous stage. Second, by this stage we have developed an ambiguity-aware model that allows us to predict valid masks in the presence of ambiguity. Specifically, we cue the model with a 32×32 regular grid of points and predict for each point a set of masks that may correspond to valid objects. In fuzzy perception models, if a point is located on a certain part or sub-part, our model will return sub-parts, local and whole objects. The IoU prediction module of the model is used to select confidence masks; furthermore, we only identify and select stable masks (if we threshold the probability maps for 0:5−δ and 0:5+δ, we get similar mask, the mask is considered stable). Finally, after selecting the confidence and stability masks, the non-maximum suppression (NMS) algorithm is used to filter duplicate data. To further improve the quality of smaller masks, we also process multiple overlapping crops of upscaled images. For more details on this phase, see §B. We apply fully automated mask generation to all 11 million images in the dataset, resulting in a total of 1.1 billion high-quality masks. Next, we describe and analyze the resulting dataset SA-1B.

5. Data set

  The proposed dataset SA-1B consists of 11M different, high-resolution, authorized, privacy-preserving images and 1.1B high-quality segmentation masks collected with Data Engine. Compare SA-1B to existing datasets and analyze mask quality and properties. We are releasing SA-1B to aid the future development of computer vision fundamental models. We note that the SA-1B will be released under a favorable license agreement for certain research uses and to protect researchers.

  • image. We licensed a new set of 11 million images from a provider who works directly with photographers. The images are high resolution (average 3300 × 4950 pixels), and the resulting data size can pose accessibility and storage challenges. Therefore, we will publish the downsampled image with the shortest side set to 1500 pixels. Even after downsampling, the resolution of our images is significantly higher than many existing vision datasets (e.g., COCO [66] images are ~480×640 pixels). Note that most current models operate on much lower resolution inputs. In the published photos, faces and license plates are blurred.
  • mask. Our data engine produced 1.1 billion masks, 99.1% of which were fully automatically generated. Therefore, the quality of automatic masking is crucial. We directly compare them to professional annotations and investigate how various mask properties compare to salient segmentation datasets. Our main conclusion (as demonstrated in the analysis below and experiments in §7) is that our automatic masks are of high quality and effective for training models. Motivated by these findings, SA-1B only includes automatically generated masks.
  • The quality of the mask. To estimate mask quality, we randomly sampled 500 images (~50k masks) and asked our expert annotators to improve the quality of all masks in these images. Annotators do this using our models and pixel-accurate 'brush' and 'eraser' editing tools. This process yields pairs of automatically predicted and professionally corrected masks. We calculated the IOUs between each pair and found that 94% of the combinations had IOUs greater than 90% (97% of the combinations had IOUs greater than 75%). As a comparison, previous work estimates inter-annotator agreement at 85-91% IoU [44, 60]. Our experiments in §7 confirm by human scoring that the quality of the masks is high relative to various datasets, and that training our model on automatic masks is almost as good as all masks produced using the data engine.
    在这里插入图片描述
  • Mask properties. In Fig. 5, we plot the spatial distribution of object centers in SA-1B and compare with existing maximal segmentation datasets. Common photographer bias exists in all datasets. We observe that SA-1B has greater image corner coverage compared to LVIS v1 [44] and ADE20K [117], the two datasets with the most similar distributions, while COCO [66] and Open Images V5 [ 60] with a more prominent center bias. In Figure 6 (legend) we compare these datasets by size. The SA-1B has 11 times more images and 400 times more masks than the next largest open image. On average, each image has 36 times more masks than open images. The closest dataset in this regard, ADE20K, still has 3.5x fewer masks per image. Figure 6 (left) plots the mask-to-image distribution. Next, we look at the mask size (mask area divided by the square root of the image area) relative to the image in Figure 6 (middle). As expected, since our dataset has more masks per image, it also tends to include a greater proportion of masks of small and medium relative sizes. Finally, to analyze shape complexity, we look at mask concavity (1 minus mask area divided by mask convex hull area) in Figure 6 (right). Since shape complexity is related to mask size, the mask size distribution of the dataset is controlled by first stratified sampling from the binned mask sizes. The bump distribution of the mask is roughly similar to that of other datasets.
    在这里插入图片描述

6, RAI analysis

  Conduct a Responsible AI (RAI) analysis of the work by investigating potential fairness issues and biases when using SA-1B and SAM. We focus on the geographic and income distribution of SA-1B and the fairness of SAM across protected attributes. We also provide datasets, data annotations, and model cards in §F.

  • Geographic and revenue representation. We infer that images from these countries were taken using standard methods (see §C). In Figure 7, we visualize the number of images per country in SA-1B (left) and the 50 countries with the most images (right). We noticed that the top three countries come from different parts of the world. Next, in Table 1, we compare the geographical and revenue representativeness of SA-1B, COCO [66] and Open Images [60]. SA-1B has a higher image ratio in Europe, Asia and Oceania, and in middle-income countries. Africa and low-income countries are underrepresented in all datasets. We note that in SA-1B, all regions, including Africa, have at least 28 million masks, 10 times more than the total number of masks in any previous data set. The average number of masks per image (not shown) is fairly consistent across regions and incomes (94-108 per image).
    在这里插入图片描述
  • Fairness in dividing groups of people. Potential fairness issues between perceived gender presentation, perceived age group, and perceived skin color were investigated by measuring differences in SAM performance between groups. We use the More Inclusive Annotation of People (MIAP) [87] dataset for gender representation and age, and a proprietary dataset for skin color (see §C). Our evaluation uses simulated interactive segmentation of randomly sampled 1- and 3-points (see §D). Table 2 (top left) shows the results for perceived gender presentation. We note that women are shown to be underrepresented in both detection and segmentation datasets [115], but observe that SAM performs similarly across groups. We repeated the analysis of perceived age in Table 2 (bottom left), noting that those perceived to be younger and older proved to be underrepresented in large-scale datasets [110]. SAM performed best on those considered older (despite the large confidence intervals). Finally, we repeated the analysis of perceived skin color in Table 2 (right), noting that in large-scale datasets, people with visibly lighter skin are shown to be over-represented, while people with darker skin are under-represented [110]. Since MIAP does not contain perceived skin color annotations, we used a proprietary dataset that contains annotations for perceived Fitzpatrick skin types [36], ranging from 1 (lightest skin tone) to 6 (darkest skin tone). Although the mean values ​​were slightly different, we did not find significant differences between the groups. We believe our findings stem from the nature of the task and acknowledge that biases may occur when SAM is used as a component in a larger system. Finally, in §C we extend our analysis to segmentation of clothing and find signs of bias in perceived gender presentation.
    在这里插入图片描述

7. Zero sample migration experiment

  In this section, we will introduce the zero-shot migration experiment of SAM (Segment Anything Model). We considered 5 tasks, 4 of which were significantly different from the promptable segmentation tasks used to train SAM. These experiments evaluate SAM on datasets and tasks not seen during training (our use of “zero-shot transfer” follows that used in CLIP [82]). The dataset may include new image distributions, such as underwater or egocentric images (Figure 8), which to the best of our knowledge are not present in SA-1B.
在这里插入图片描述
  The experiments started by testing the core goal of cueable segmentation: generating efficient masks from any cue. This article highlights the challenging scenario of a single foreground cue, as it is more likely to be ambiguous than other, more specific cues. A series of experiments are presented that traverse low, mid, and high-level image understanding and roughly parallel the historical development of the field. (2) segment everything, i.e. object proposal generation, (3) segment detected objects, i.e. instance segmentation, and (4), as a proof of concept, segment objects from free-form text. These four tasks are significantly different from the promptable segmentation task trained by SAM and implemented through the prompt project. Our experiments conclude with an ablation study.
  accomplish. Unless otherwise specified: (1) SAM uses the ViT-H [33] image encoder pretrained by MAE [47], (2) SAM is trained on SA-1B, note that this dataset only includes data from our engine The final stage automatically generates the mask. For all other model and training details such as hyperparameters, see §A.

7.1. Zero-sample single-point effective mask evaluation

  • Task. This paper evaluates object segmentation from a single foreground point. This task is ill-posed because a point can point to multiple objects. Ground-truth masks in most datasets do not enumerate all possible masks, which can make automated measures unreliable. Therefore, this paper complements the standard mIoU metric (i.e., the average of all IoUs between predicted and true masks) with a human study in which annotators rated mask quality from 1 (none) meaning) to 10 (pixel perfect). See §D. 1, additional details in §E and §G.
  • 默认情况下,我们从真实掩码的“中心”(在掩码内部距离变换的最大值)采样点,遵循交互式分割中的标准评估协议[92]。由于SAM能够预测多个掩码,因此默认情况下我们只评估模型中最自信的掩码。基线都是单掩码方法。我们主要与RITM[92]进行比较,RITM是一种强大的交互式分割器,与其他强基线[67,18]相比,它在我们的基准上表现最好。
  • 数据集。我们使用了一套新编译的23个具有不同图像分布的数据集。图8列出了这些数据集,并展示了每个数据集的一个示例(更多细节请参见附录表7)。我们使用所有23个数据集进行mIoU评估。对于人体研究,我们使用图9b中列出的子集(由于此类研究的资源需求)。这个子集包括SAM在自动指标上优于和低于RITM的两个数据集。
  • 结果。首先,我们研究了使用mIoU对一整套23个数据集的自动评估。我们将图9a中每个数据集的结果与RITM进行了比较。SAM在23个数据集中的16个上获得了更高的结果,高出了47 IoU。本文还提出了一个" oracle "结果,通过将SAM的3个面具与基本事实进行比较,选择出最相关的面具,而不是选择最自信的面具。这揭示了歧义性对自动评测的影响。特别地,在oracle进行二义性解析时,SAM在所有数据集上的表现都优于RITM。
  • 人体研究结果见图9b。误差条为平均掩膜评分的95%置信区间(所有差异均显著;详见§E)。标注者对SAM的掩模质量的评价始终大大高于最强的基线RITM。带有单一输出掩码的减弱版SAM的评分始终较低,但仍然高于RITM。SAM的平均评级在7到9之间,这与定性评级指南相对应:“一个高分(7-9):对象是可识别的,错误很小且很少(例如,错过一个小的、严重遮挡的不连接组件,……)。”这些结果表明,SAM已经学会了从单点分割有效掩码。请注意,对于像DRAM和IBD这样的数据集,SAM在自动指标上较差,但在人工研究中始终获得较高的评分。
    在这里插入图片描述
      图9c显示了其他基线SimpleClick[67]和FocalClick[18],它们的单点性能低于RITM和SAM。随着点数从1增加到9,方法之间的差距减小。随着任务变得更容易,这是意料之中的。此外,SAM并没有针对非常高的欠条制度进行优化。最后,在图9d中,我们将默认的中心点采样替换为随机点采样。我们观察到SAM和基线之间的差距在增长,SAM能够在两种采样方法下取得可比的结果。

7.2、零样本边缘检测

  • 方法。我们使用BSDS500[72,3]在经典的低层次边缘检测任务上评估SAM。我们使用自动掩码生成管道的简化版本。用16×16规则的前景点网格提示SAM,产生768个预测掩模(每个点3个)。网管删除冗余的掩码。然后,使用非阈值掩码概率图的Sobel滤波和标准的轻量级后处理来计算边缘图,包括边缘NMS(见§D。2)。
    在这里插入图片描述
  • 结果。我们在图10中可视化了代表性的边缘图(更多信息请参见图15)。从质量上说,即使SAM没有经过边缘检测的训练,它也能产生合理的边缘图。与真实值相比,SAM预测了更多的边,包括BSDS500中没有标注的合理边。这种偏差在表3中定量地反映出来:50%精度(R50)下的召回率很高,这是以精度为代价的。SAM自然落后于学习BSDS500偏差的最先进方法,即要抑制哪些边缘。然而,与先驱的深度学习方法(如HED[108] (也在BSDS500上训练))相比,SAM表现良好,并明显优于之前的零样本迁移方法,尽管不可否认已过时。
    在这里插入图片描述

7.3、零样本目标建议

  • 方法。接下来,在目标建议生成的中层任务上评估SAM[2,102]。该任务在目标检测研究中发挥了重要作用,作为开拓性系统(如[102,41,84])的中间步骤。为了生成目标建议,我们运行了自动掩码生成管道的一个稍微修改的版本,并将掩码作为建议输出(见§D。3)。
  • 我们在LVIS v1[44]上计算标准平均召回率(AR)指标。我们关注LVIS,因为它的大量类别提出了一个具有挑战性的测试。将其与作为ViTDet[62]检测器实现的强基线(使用级联Mask R-CNN [48,11] ViT-H)进行比较。我们注意到,这个“基线”对应于向游戏AR展示的“探测器伪装为建议生成器”(DMP)方法[16],使其成为一个真正要求很高的比较。
    在这里插入图片描述
  • result. In Table 4, we unsurprisingly see that detections using ViTDet-H as object proposals (i.e., the DMP method [16] for game AR) perform best overall. However, SAM performs very well on several metrics. Notably, it outperforms ViTDet-H on medium and large objects as well as rare and common objects. In fact, SAM only performs worse than ViTDet-H on small and frequent objects, where ViTDet-H can easily learn LVIS-specific annotation biases because, unlike SAM, it is trained on LVIS. A comparison is also made with a disambiguated version of SAM ("single out."), which performs significantly worse than SAM on all AR metrics.

7.4. Zero-sample instance segmentation

  • method. Speaking of higher-level vision, we use SAM as the segmentation module of the instance segmenter. The implementation is simple: we run an object detector (ViTDet used previously) and prompt the SAM with its output box. This illustrates how SAMs can be combined in a larger system.
    在这里插入图片描述
  • result. We compare the masks predicted by SAM and ViTDet on COCO and LVIS in Table 5. Looking at the mask AP metric, we observe a gap on both datasets, SAM is fairly close, but definitely lags behind ViTDet. By visualizing the output, we observe that SAM masks are generally qualitatively better than those of ViTDet, with sharper boundaries (see §D.4, Figure 16). To investigate this observation, we conducted an additional human study asking annotators to rate ViTDet masks and SAM masks on a previously used quality scale of 1 to 10. In Fig. 11, we observe that SAM consistently outperforms ViTDet in human studies.
    在这里插入图片描述
    Assuming that on COCO, the mask AP gap is large and the ground truth quality is relatively low (confirmed by human studies), ViTDet learned about the specific bias of COCO masks. SAM is a zero-sample method and cannot exploit these (often undesirable) biases. The LVIS dataset has higher quality ground truth, but still has specific properties (e.g. the masks do not contain holes, they are structurally simple polygons) and modal mask-to-modal mask deviations. Again, SAM is not trained to learn these biases, whereas ViTDet can exploit them.

7.5、Zero-Shot Text-to-Mask

  • method. Finally, consider a higher-level task: segmenting objects from free-form text. This experiment is a proof-of-concept for SAM's ability to process textual prompts. While we used the exact same SAM in all previous experiments, the training process for this SAM was modified to make it text-aware, but in a way that does not require new text annotations. For each manually collected mask with an area larger than 1002, CLIP image embeddings are extracted. Then, during the training process, the extracted CLIP image embedding is used as the first interaction of SAM. The key observation here is that because CLIP’s image embeddings are trained to be aligned with their text embeddings, we can train with the image embeddings but use the text embeddings for inference. That is, at inference time, we run the text through CLIP's text encoder and then feed the resulting text embeddings as hints to SAM (see §D.5 for details).
    在这里插入图片描述
  • result. We show qualitative results in Figure 12. SAM can segment objects based on simple text cues such as "a wheel" as well as phrases such as "beaver tooth grille." When SAM fails to select the correct object from textual cues alone, an extra point usually fixes the prediction, similar to [31].

7.6. Ablation research

We performed several ablations on a suite of 23 datasets using a single center point cueing protocol. Recall that individual points may be ambiguous, and this ambiguity may not be represented in the ground truth, each point contains only one mask. Because SAM operates in a zero-sample transmission environment, there may be systematic deviations between SAM's top-level masks and those produced by data annotation guidelines. Therefore, we additionally report the optimal mask (“oracle”) with respect to the ground truth.
在这里插入图片描述
  Figure 13 (left) plots the performance of SAM when trained on the data accumulated in the data engine stage. We observe that mIoU increases at each stage. When trained using all three stages, automatic masks significantly outnumbered manual and semi-automatic masks. To address this issue, we found that oversampling manual and semi-automatic masks by a factor of 10 during training gave the best results. This setup complicates training. Therefore, we tested a fourth setup, which only uses automatically generated masks. When using this data, SAM's performance is only slightly lower (about 0.5 mIoU) than when using all data. Therefore, by default we only use automatically generated masks to simplify the training setup.
  In Figure 13 (middle) we look at the impact of data volume. The full SA-1B contains 11 million images, which we uniformly sample to 1M and 0.1M for ablation. Across 0.1 million images, we observe a large drop in mIoU across all settings. However, with 1 million images, about 10% of the entire dataset, we observe comparable results to those using the entire dataset. This data mechanism still includes about 100 million masks, which may be a practical setup for many use cases.
  Finally, Fig. 13 (right) shows the results of the ViT-B, ViT-L and ViT-H image encoders. ViT-H shows a significant improvement over ViT-B, but only a marginal gain over ViT-L. Further image encoder scaling doesn't seem to be fruitful at the moment.

8. Discussion

  • base model. Pretrained models have been adapted to downstream tasks since the early days of machine learning [99]. In recent years, this paradigm has become increasingly important with the increasing emphasis on scale, and such models have recently been (re)called "base models": that is, "models trained on large-scale data and adapted to a wide range of downstream tasks” [8]. This work relates well to this definition, but notes that the underlying model for image segmentation is one of inherently limited scope, as it represents an important but fractional-order subset of computer vision. One aspect of the approach is also contrasted with [8], which emphasizes the role of self-supervised learning in the underlying model. Although the model is initialized with self-supervised techniques (MAE [47]), the vast majority of its capabilities come from large-scale supervised training. In cases where the data engine can expand the available annotations, as in our example, supervised training provides an efficient solution.
  • Combination. Pretrained models can provide new capabilities even beyond what was imagined during training. A prominent example is how CLIP [82] can be used as a component in a larger system, such as DALL·E [83]. Our goal is to implement this combination directly with SAM. This paper aims to achieve this by asking SAMs to predict effective masks for a wide range of segmentation cues. The effect is to create a reliable interface between SAM and other components. For example, MCC [106] can easily use SAM to segment objects of interest and achieve strong generalization to unseen objects for 3D reconstruction from a single RGB-D image. In another example, a SAM can be prompted by gaze detected by a wearable device, enabling new applications. Due to SAM's ability to generalize to new domains such as egocentric images, such a system can work without additional training.
  • 局限性。虽然SAM总体上表现良好,但它并不完美。它可能会错过精细的结构,有时会产生小的不连接组件的幻觉,并且不像“放大”的更计算密集的方法(如[18])那样产生清晰的边界。通常,当提供许多点时,我们期望专用的交互式分割方法优于SAM,例如[67]。与这些方法不同的是,SAM被设计为通用性和使用广度,而不是高IoU交互式分割。此外,SAM可以实时处理提示信息,但当使用大型图像编码器时,SAM的整体性能不实时。我们对文本到掩码任务的尝试是探索性的,并不完全健壮,尽管我们相信可以通过更多的努力来改进。虽然SAM可以执行许多任务,但尚不清楚如何设计简单的提示来实现语义和全景分割。最后,还有一些特定于领域的工具,如[7],我们希望它们在各自的领域中比SAM表现更好。
  • 结论。Segment Anything项目试图将图像分割提升到基础模型时代。本文的主要贡献是一个新的任务(promptable segmentation)、模型(SAM)和数据集(SA-1B),使这一飞跃成为可能。SAM是否达到了基础模型的地位还有待于它在社区中如何使用,但无论如何,我们期待这项工作的前景,超过1B个口罩的发布和我们的promptable分割模型将有助于铺平前进的道路。

Guess you like

Origin blog.csdn.net/qq_33319476/article/details/130485862