Segment Anything paper translation, SAM model, SAM paper, SAM paper translation; a new task, model and dataset for image segmentation; SA-1B dataset

【Paper translation】- Segment Anything / Model / SAM paper

Paper link:

Code link: https://github.com/facebookresearch/segment-anything

Paper translation:

Summary

This paper presents the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we construct the largest segmentation dataset to date (so far), with over 1 billion masks on 11 million licensed and privacy-respecting images. The model is designed and trained to be promptable, so it can zero-shot transfer to new image distributions and tasks. We evaluated its capabilities on a number of tasks and found its zero-shot performance to be impressive—often competing with, or even outperforming, previous fully supervised results. We will be publishing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) at https://segment-anything.com" containing 1B masks and 11 million images to facilitate knowledge of the fundamentals of computer vision Model research.
1-18.PNG

1 Introduction

Large language models pretrained on web-scale datasets are revolutionizing NLP with powerful zero-shot and few-shot generalization [10]. These "base models" [8] can generalize beyond the tasks and data distributions seen during training. This capability is often achieved through hint engineering, where hand-crafted text is used to prompt language models to generate valid textual responses for the task at hand. When scaled and trained with rich text corpora from the web, the zero- and few-shot performance of these models compares surprisingly well to (and in some cases even matches) fine-tuned models [10, 21]. Empirical trends show that this behavior improves with increasing model size, dataset size, and total training computation [56, 10, 21, 51].

Fundamental models have also been explored, albeit to a lesser extent, in the field of computer vision. Perhaps the most prominent illustrations align pairs of text and images from around the web. For example, CLIP [82] and ALIGN [55] use contrastive learning to train text and image encoders that align two modalities. After training, engineered text cues can generalize to new visual concepts and data distributions with zero-shot. Such encoders can also be efficiently combined with other modules to achieve downstream tasks such as image generation (e.g. DALL·E [83]). While much progress has been made in vision and language encoders, computer vision encompasses a wide range of problems beyond that, many of which for which no abundant training data exists.

The goal of this paper is to build a basic model for image segmentation. That is, this paper attempts to develop a promptable model and pre-train it on a wide range of datasets using a task that achieves strong generalization. Using this model, we aim to solve a series of downstream segmentation problems on new data distributions using prompt engineering.

The success of this program depends on three components: tasks, models, and data. To develop them, this paper addresses the following questions about image segmentation.

1. What tasks will achieve zero-shot generalization?

2. What is the corresponding model architecture?

3. What data can support this task and model?

These issues are complex and require a comprehensive solution. We first define a hintable segmentation task, general enough to provide a powerful pre-training objective, and enable a wide range of downstream applications. This task requires a model that supports flexible hints and can output segmentation masks in real-time when prompted to allow interactive use. To train our model, we need a diverse, large-scale data source. Unfortunately, there are no web-scale sources of segmented data; to solve this problem, we build a "data engine", that is, we use efficient models to help collect data and use newly collected data to improve models. between iterations. Each interrelated component is presented next, followed by the dataset created and experiments demonstrating the effectiveness of the method.

Task (§2). In NLP and more recently computer vision, foundational models are a promising development that enable zero-shot and few-shot learning on new datasets and tasks by using "hinting" techniques. Inspired by this line of work, this paper proposes the promptable segmentation task, where the goal is to return a valid segmentation mask given any segmentation prompt (see Figure 1a). Prompts simply specify what is to be segmented in the image, for example, prompts can include spatial or textual information identifying objects. The requirement for a valid output mask means that even if the cue is ambiguous and might point to multiple objects (e.g. a dot on a shirt might indicate either the shirt or the person wearing it), the output should be a plausible mask for at least one of those objects. code. We use the hint segmentation task as a pre-training target, and solve general downstream segmentation tasks through hint engineering.

model (§3). The hintable segmentation task and the goals of real-world use impose constraints on the model architecture. In particular, the model must support flexible hinting, requires real-time amortized computation of masks to allow interactive use, and must be ambiguity-aware. A simple design satisfies all three constraints: a robust image encoder computes image embeddings, a hint encoder embeds hints, and then combines the two sources of information in a lightweight mask decoder that predicts segmentation masks . We call this model the Segment Anything model, or SAM for short (see Figure 1b). By separating the SAM into an image encoder and a prompt fast encoder/mask decoder, the same image embedding can be reused (and its cost shared) across different prompts. Given an image embedding, the hint encoder and mask decoder take 50ms to predict a mask from a hint in a web browser. Focuses on point, box, and mask cues, and also presents preliminary results with free-form text cues. To make SAM ambiguous, it is designed to predict multiple masks for a single cue, enabling SAM to handle ambiguities naturally, such as the shirt and person examples.

02-18.PNG

Data Engine (§4). To achieve strong generalization to new data distributions, it is necessary to train SAMs on a large and diverse set of masks than any existing split dataset. While the typical approach for base models is to get the data online [82], the masks themselves are not rich, so we need another strategy. Our solution is to build a "data engine", i.e. we co-develop our models with in-the-loop model dataset annotations (see Figure 1c). Our data engine has 3 stages: assisted manual, semi-automatic and fully automatic. In the first stage, SAM helps annotators to annotate masks, similar to the classic interactive segmentation setup. In the second stage, SAM can automatically generate masks for a subset of objects by hinting at its possible object locations, and the annotator focuses on annotating the remaining objects, helping to increase mask diversity. In the final stage, the SAM is prompted with a regular grid of foreground points, yielding an average of 100 high-quality masks per image.

03-18.PNG

Datasets (§5). The final dataset SA-1B includes over 1B masks from 11M authorized and privacy-preserving images (see Figure 2). The masks for SA-1B collected fully automatically using the final stage of the data engine are 400 times more than any existing segmentation dataset [66, 44, 117, 60] and have been extensively validated for high quality and diversity. In addition to being useful for training SAMs to be robust and general, we hope that SA-1B will be a valuable resource for research aimed at building new foundational models.

Responsible AI (§6). Potential fairness issues and biases when using SA-1B and SAM are researched and reported. The images in SA-1B cover a geographically and economically diverse set of countries, and we found that SAM performed similarly across populations. All in all, we hope this will make our work fairer for real-world use cases. We provide model cards and dataset cards in the appendix.

Experiment (§7). We perform an extensive evaluation of SAM. First, using a diverse set of 23 segmentation datasets, SAM produces high-quality masks from individual foreground points, usually only slightly below the manually annotated ground truth. Using the prompt-engineered zero-shot transfer protocol, we achieve consistently strong quantitative and qualitative results for a variety of downstream tasks, including edge detection, object proposal generation, instance segmentation, and an initial exploration of text-to-mask prediction. These results demonstrate that SAM can be used out-of-the-box with prompt engineering to solve a variety of tasks involving object and image distributions beyond the SAM training data. However, as we discussed in §8, room for improvement still exists.

Release. For research purposes, we release the SA-1B dataset and make SAM available at https://segment-anything.com under a permissive open license (Apache 2.0). We also demonstrated the capabilities of SAM through an online demo.

2. Segment any object task

Taking inspiration from NLP, the next token prediction task is used for base model pre-training and to solve various downstream tasks through hint engineering [10]. To establish a base model for segmentation, this paper aims to define a task with similar capabilities.

04-18.png

Task. First, we translate the idea of ​​hints from NLP to segmentation, where a hint can be a set of foreground/background points, a rough box or mask, free-form text, or in general, any information representing the image to be segmented . A promptable split task, then, is all about returning a valid split mask given any prompt. The requirement of a "valid" mask simply means that even if the cue is ambiguous and may point to multiple objects (e.g. recall the shirt and person example, see Figure 3), the output should be of at least one of these objects Reasonable mask. This requirement is similar to expecting a language model to output coherent responses to ambiguous cues. This task was chosen because it leads to a natural pre-training algorithm and a general method for zero-shot transfer to downstream segmentation tasks via hints.

pre-training. The promptable segmentation task proposes a natural pre-training algorithm that simulates a series of prompts (e.g., points, boxes, masks) for each training sample and compares the model's mask predictions with the ground truth. This paper adopts this approach from interactive segmentation [109,70], although unlike interactive segmentation whose goal is to eventually predict a valid mask after enough user input, our goal is to always predict a valid mask for any hint, Even the prompt is vague. This ensures that pretrained models are effective in use cases involving ambiguity, including the automatic annotations required by our data engine §4. We note that performing well on this task is challenging and requires specialized modeling and training loss choices, which we discuss in §3.

Zero-shot migration. The pre-training task endows the model with the ability to respond appropriately to any cues at inference time, so downstream tasks can be solved by engineering appropriate cues. For example, if we have a bounding box detector for cats, we can solve cat instance segmentation by providing the detector's box output as a hint to our model. In general, a series of practical segmentation tasks can be used as hints. In addition to automatic dataset labeling, five different example tasks are explored in the experiments in §7.

related tasks. Segmentation is a broad field: there are interactive segmentation [57,109], edge detection [3], superpixelation [85], object proposal generation [2], foreground segmentation [94], semantic segmentation [90], instance segmentation [ 66], panoramic segmentation [59], etc. The goal of the hint segmentation task is to produce a broadly functional model that can be adapted to many (though not all) existing and new segmentation tasks through hint engineering. This ability is a form of task generalization [26]. Note that this differs from previous work on multi-task splitting systems. In a multi-task system, a single model performs a fixed set of tasks, such as joint semantic, instance, and panoptic segmentation [114, 19, 54], but the training and testing tasks are the same. An important distinction in our work is that a model trained for cue segmentation can be used as a component in a larger system to perform new and different tasks at inference time, e.g. to perform instance segmentation by combining a cue segmentation model with an existing combined with object detectors.

discuss. Hints and combinations are powerful tools that enable a single model to be used in a scalable manner, potentially accomplishing tasks not known at the time the model was designed. This approach is similar to how other base models are used, such as CLIP [82] , the text-image alignment component of the DALL E [83] image generation system. We anticipate that composable system designs driven by techniques such as prompt engineering will enable broader applications than systems trained specifically for a fixed set of tasks. It is also interesting to compare cue segmentation and interaction segmentation through a compositional perspective: while the interaction segmentation model is designed for human users, models trained for cue segmentation can also be combined into a larger algorithmic system, as we will demonstrate.

3. Segment any object model

Next, we describe the Segmentation Arbitrary Model (SAM) for fast segmentation. SAM has three components, as shown in Figure 4: Image Encoder, Flexible Hint Encoder and Fast Mask Decoder. Transformer vision models [14, 33, 20, 62] are built with specific trade-offs for (amortized) real-time performance. We describe these components at a high level here, see §A for details.

05-18.PNG

image encoder. Motivated by the scalability and powerful pre-training methods, this paper uses MAE [47] to pre-train the Visual Transformer (ViT) [33] minimally adapted to handle high-resolution inputs [62]. The image encoder runs once per image and can be applied before hinting the model.

Hint encoder. Consider two sets of cues: sparse (points, boxes, text) and dense (masks). We represent points and boxes via positional encodings [95] and sum the learned embeddings for each cue type and free-form text using an off-the-shelf text encoder from CLIP [82]. Dense cues (i.e. masks) use convolutional embeddings and are summed with image embedding elements.

mask decoder. A mask decoder efficiently maps image embeddings, cue embeddings, and output tokens to masks. Inspired by [14,20], the design employs a modification of the Transformer decoder block [103] followed by a dynamic mask prediction header. The modified decoder block updates all embeddings using cue self-attention and cross-attention in both directions (cues to image embeddings and vice versa). After running both blocks, we upsample the image embeddings, the MLP maps the output labels to a dynamic linear classifier, and then computes the masked foreground probability for each image location.

Resolve ambiguity. With one output, the model will average multiple valid masks if given a hint of ambiguity. To address this, we modify the model to predict multiple output masks for a single cue (see Figure 3). We found that 3 mask outputs are sufficient for most common cases (nested masks are usually up to 3 deep: whole, part and subpart). During training, we only backpropagate the minimum loss [15, 45, 64] for the mask. To rank masks, the model predicts a confidence score (i.e. estimated IoU) for each mask.

efficiency. The overall model design is largely driven by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on a CPU, and take about 50 milliseconds. This runtime performance enables our model to interact with prompts seamlessly and in real time.

loss and training. We supervise mask prediction with a linear combination of focal loss [65] and dice loss [73] used in [14]. The promptable segmentation task is trained using a mixture of geometric cues (see §7.5 for textual cues). Following [92, 37], an interactive setting is simulated by randomly sampling cues in each mask for 11 rounds, allowing SAM to be seamlessly integrated into our data engine.

4. Split any data engine

Since segmentation masks are not abundant on the Internet, we built a data engine to collect our 11b mask dataset SA-1B. This data engine has three stages: (1) a model-assisted manual labeling stage, (2) a semi-automatic stage in which automatically predicted masks and model-assisted labeling are mixed, and (3) a fully automatic stage in which, Our model generates masks without input from annotators. We describe each in detail next.

Assisted-manual stage. In the first stage, similar to classic interactive segmentation, a team of professional annotators labels masks by clicking on foreground/background object points using a browser-based interactive segmentation tool powered by SAM. Masks can be refined using the pixelprecise "Brush" and "Eraser" tools. Our model-assisted annotation runs in real-time directly in the browser (using precomputed image embeddings), enabling a truly interactive experience. We impose no semantic constraints on the annotated objects, and annotators freely label "things" and "stuff" [1]. We recommend that annotators mark objects that they can name or describe, but do not collect these names or descriptions. Annotators were asked to label objects by their prominence and were encouraged to move on to the next image when the mask annotation exceeded 30 s.

At the beginning of this phase, the SAM is trained using common public segmentation datasets. After sufficient data annotation, the SAM is retrained using only the newly annotated masks. As more masks are collected, the image encoder is extended from ViT-B to ViT-H, and other architectural details are developed; we retrain our model 6 times in total. As the model improves, the average annotation time per mask decreases from 34 seconds to 14 seconds. We note that 14 seconds is 6.5 times faster than COCO's mask annotation [66] and only 2 times slower than bounding box labeling of extreme points [76, 71]. With the improvement of SAM, the average number of masks per image increases from 20 masks to 44 masks. Overall, we collected 4.3 million masks from 120,000 images at this stage.

semi-automatic stage. At this stage, we aim to increase the diversity of masks to improve our model's ability to segment anything. To enable the annotators to focus on less salient objects, we first automatically detect confident masks. We then showed the annotators images pre-filled with these masks and asked them to annotate any other unannotated objects. To detect confidence masks, we train a bounding box detector [84] on all first-stage masks using the generic “object” category. At this stage, we collected an additional 5.9 million masks in 180k images (total 10.2 million masks). As in the first stage, we periodically retrain the model (5 times) on the newly collected data. The average annotation time per mask is back to 34 seconds (excluding automatic masking), as these objects are more challenging. The average number of masks per image was increased from 44 to 72 (including automatic masks).

fully automatic stage. In the final stage, labeling is fully automatic. This is possible because our model has two major enhancements. First, at the beginning of this stage, we collected enough masks to greatly improve the model, including different masks from the previous stage. Second, by this stage we have developed a blur-aware model that allows us to predict effective masks in the presence of ambiguity. Specifically, we hint the model with a regular 32×32 grid of points and predict for each point a set of masks that likely correspond to valid objects. In blur-aware models, if a point is located on a certain part or subpart, our model returns subparts, parts, and the whole object. The model's IoU prediction module is used to select confidence masks; moreover, we only identified and selected stable masks (if thresholding the probability maps of 0:5−δ and 0:5+δ, we get similar mask, the mask is considered stable). Finally, after selecting the confidence and stability masks, non-maximum suppression (NMS) algorithm is used for duplicate data filtering. To further improve the quality of smaller masks, we also processed multiple overlapping crops of enlarged images. See §B for more details on this phase. We applied fully automated mask generation to all 11 million images in the dataset, resulting in a total of 1.1 billion high-quality masks. Next, we describe and analyze the resulting dataset SA-1B.

5. Dataset

The proposed dataset SA-1B consists of 11M diverse, high-resolution, licensed, privacy-preserving images and 1.1 billion high-quality segmentation masks collected with a data engine. Compare SA-1B with existing datasets and analyze the quality and properties of the masks. We are releasing SA-1B to aid in the future development of fundamental models for computer vision. We note that SA-1B will be released under a favorable license agreement for certain research uses and to protect researchers.

image. We licensed a new set of 11 million images from a provider who works directly with photographers. The images are high resolution (average 3300 × 4950 pixels), and the resulting data size can pose accessibility and storage challenges. Therefore, we will publish the downsampled image with the shortest side set to 1500 pixels. Even after downsampling, the resolution of our images is significantly higher than many existing vision datasets (e.g., COCO [66] images are ~480×640 pixels). Note that most current models operate on much lower resolution inputs. In the published photos, faces and license plates are blurred.

mask. Our data engine produced 1.1 billion masks, 99.1% of which were fully automatically generated. Therefore, the quality of automatic masks is critical. We directly compare them to professional annotations and investigate how various masking properties compare to salient segmentation datasets. Our main conclusion (as demonstrated in the analysis below and experiments in §7) is that our automatic masks are of high quality and effective for training models. Motivated by these findings, SA-1B only includes automatically generated masks.

The quality of the mask. To estimate mask quality, we randomly sampled 500 images (~50k masks) and asked our professional annotators to improve the quality of all masks in these images. Annotators do this using our models and pixel-accurate 'brush' and 'eraser' editing tools. This process yields pairs of automatically predicted and professionally corrected masks. We calculated the IOUs between each pair and found that 94% of the combinations had IOUs greater than 90% (97% of the combinations had IOUs greater than 75%). As a comparison, previous work estimates inter-annotator agreement at 85-91% IoU [44, 60]. Our experiments in §7 confirm by human scoring that the quality of the masks is high relative to a variety of datasets, and training our model on automatic masks is almost as good as using all masks produced by the data engine.
06-18.PNG

Properties of the mask. In Fig. 5, we plot the spatial distribution of object centers in SA-1B and compare with existing maximal segmentation datasets. Common photographer bias exists in all datasets. We observe that SA-1B has greater coverage of image corners compared to LVIS v1 [44] and ADE20K [117], the two most similarly distributed datasets, while COCO [66] and Open Images V5 [ 60] have a more prominent center bias. In Figure 6 (legend), we compare these datasets by size. SA-1B has 11x more images and 400x more masks than the next largest open image. On average, each image has 36 times more masks than open images. The closest dataset in this respect, ADE20K, still has 3.5 times fewer masks per image. Figure 6 (left) plots the mask-periphery distribution. Next, we look at the mask size (the square root of the mask area divided by the image area) relative to the image in Figure 6 (middle). Unsurprisingly, since our dataset has more masks per image, it also tends to include a larger proportion of masks of small and medium relative sizes. Finally, to analyze shape complexity, we look at mask bumpiness (1 minus mask area divided by mask convex hull area) in Figure 6 (right). Since the shape complexity is related to the mask size, the mask size distribution of the dataset is controlled by first performing stratified sampling from the binned mask sizes. The bump distribution of the mask is roughly similar to that of other datasets.

07-18.PNG

6, RAI analysis

Conduct a Responsible AI (RAI) analysis of the work by investigating potential fairness issues and biases when using SA-1B and SAM. We focus on the geographic and income distribution of SA-1B and the fairness of SAM across protected attributes. We also provide datasets, data annotations, and model cards in §F.

Geography and income representation. We infer that images of these countries were taken using standard methods (see §C). In Figure 7, we visualize the number of images per country in SA-1B (left) and the top 50 countries with the most images (right). We note that the top three countries come from different regions of the world. Next, in Table 1, we compare the geographical and income representation of SA-1B, COCO [66] and Open Images [60]. SA-1B has a higher image ratio in Europe, Asia and Oceania, and middle-income countries. All datasets are underrepresented for Africa and low-income countries. We note that in SA-1B, all regions including Africa have at least 28 million masks, which is 10 times more than the total number of masks in any previous dataset. The average number of masks per image (not shown) is fairly consistent across regions and incomes (94–108 per image).

08-18.PNG

fairness in dividing the population. Potential equity issues between perceived gender presentation, perceived age group, and perceived skin color were investigated by measuring differences in SAM performance between groups. We use the More Inclusive Person Annotation (MIAP) [87] dataset for gender representation and age, and a proprietary dataset for skin color (see §C). Our evaluation simulates an interactive split using random sampling of 1 and 3 points (see §D). Table 2 (top left) shows the results for perceived gender presentation. We note that women are shown to be underrepresented in both detection and segmentation datasets [115], but observe that SAM performs similarly across groups. We repeat the analysis of perceived age in Table 2 (bottom left), noting that those perceived as younger and older were shown to be underrepresented in large-scale datasets [110]. SAM performed best on those considered older (despite the large confidence intervals). Finally, we repeated the analysis of perceived skin color in Table 2 (right), noting that in large-scale datasets, people with significantly lighter skin were shown to be overrepresented, while those with darker skin were underrepresented [110]. Since MIAP does not include perceived skin tone annotations, we use a proprietary dataset that contains annotations of perceived Fitzpatrick skin types [36] on a scale from 1 (lightest skin tone) to 6 (darkest skin tone). Although the mean values ​​were slightly different, we did not find significant differences between the groups. We believe our findings stem from the nature of the task and acknowledge that bias may occur when the SAM is used as a component in a larger system. Finally, in §C we extend our analysis to the segmentation of clothing, finding signs of bias in perceived gender presentation.
09-18.PNG

7. Zero sample migration experiment

In this section, we introduce the zero-shot transfer experiment of SAM (Segment Anything Model). We considered 5 tasks, 4 of which differ significantly from the promptable segmentation tasks used to train SAMs. These experiments evaluate SAM on datasets and tasks not seen during training (our use of "zero-shot transfer" follows that in CLIP [82]). The dataset may include novel distributions of images, such as underwater or egocentric images (Fig. 8), which, to our knowledge, do not appear in SA-1B.
10-18.PNG

The experiments start by testing the core goal of cueable segmentation: generating efficient masks from any cue. This paper emphasizes the challenging scenario of a single foreground point cue, as it is more likely to be ambiguous than other more specific cues. A series of experiments is presented, traversing low-, mid-, and high-level image understanding, and roughly paralleling the historical development of the field. (2) segment everything, i.e. object proposal generation, (3) segment detected objects, i.e. instance segmentation, and (4), as a proof of concept, segment objects from free-form text. These four tasks are significantly different from the promptable segmentation tasks trained by SAM and implemented through prompt engineering. Our experiment ends with an ablation study.

accomplish. Unless otherwise specified: (1) SAM uses MAE [47] pre-trained ViT-H [33] image encoder, (2) SAM is trained on SA-1B, note that this dataset only includes data from our data engine The mask is automatically generated in the final stage. For all other model and training details, such as hyperparameters, see §A.

7.1. Zero-sample single-point effective mask evaluation

Task. This paper evaluates object segmentation from a single foreground point. This task is pathological because a point can point to multiple objects. Ground-truth masks in most datasets do not enumerate all possible masks, which can make automatic metrics unreliable. Therefore, this paper complements the standard mIoU metric (i.e., the average of all IOUs between the predicted and true masks) with a human study in which annotators rated mask quality from 1 (none meaning) to 10 (pixel perfect). See §D. 1, Additional details on §E and §G.

By default, we sample points from the "center" of the ground truth mask (the maximum of the distance transform inside the mask), following the standard evaluation protocol in Interactive Segmentation [92]. Since SAM is capable of predicting multiple masks, by default we only evaluate the most confident mask in the model. The baselines are all single-mask methods. We mainly compare with RITM [92], a powerful interactive segmenter that performs best on our benchmarks compared to other strong baselines [67, 18].

data set. We used a newly compiled set of 23 datasets with different image distributions. Figure 8 lists these datasets and shows an example of each dataset (see Appendix Table 7 for more details). We use all 23 datasets for mIoU evaluation. For human studies, we use the subset listed in Figure 9b (due to the resource requirements of such studies). This subset includes two datasets where SAM outperforms and underperforms RITM on automatic metrics.

result. First, we investigate automatic evaluation using mIoU on a full set of 23 datasets. We compare the results on each dataset in Figure 9a with RITM. SAM achieves higher results on 16 out of 23 datasets by 47 IoU. The paper also presents an "oracle" result by comparing the 3 masks of the SAM with the ground truth and selecting the most relevant mask instead of choosing the most confident one. This reveals the impact of ambiguity on automatic evaluation. In particular, SAM outperforms RITM on all datasets when Oracle performs ambiguity resolution.

The results of the human studies are shown in Figure 9b. Error bars are 95% confidence intervals for the mean mask score (all differences were significant; see §E for details). Annotators consistently rated the mask quality of SAM significantly higher than the strongest baseline RITM. The attenuated version of SAM with a single output mask scores consistently lower, but still higher than RITM. The average SAM rating is between 7 and 9, which corresponds to the qualitative rating guidelines: "A high score (7-9): Objects are identifiable, errors are small and rare (e.g., missing a small, serious Disconnected components of occlusion, ...)." These results show that SAM has learned to segment effective masks from single points. Note that for datasets like DRAM and IBD, SAM is poor on automatic metrics, but consistently achieves high scores in human studies.

11-18.PNG

Figure 9c shows the other baselines SimpleClick [67] and FocalClick [18], which have lower single-point performance than RITM and SAM. As the number of points increases from 1 to 9, the gap between methods decreases. This is to be expected as the task becomes easier. Also, SAM is not optimized for very high IOU regimes. Finally, in Figure 9d, we replace the default central point sampling with random point sampling. We observe a growing gap between SAM and the baseline, and SAM is able to achieve comparable results under both sampling methods.

7.2. Zero-sample edge detection

method. We evaluate SAM on the classic low-level edge detection task using BSDS500 [72, 3]. We use a simplified version of the automatic mask generation pipeline. The SAM is prompted with a 16×16 regular grid of foreground points, resulting in 768 prediction masks (3 for each point). The NMS deletes redundant masks. Edge maps are then computed using Sobel filtering of non-thresholded masked probability maps and standard lightweight postprocessing, including edge NMS (see §D.2).
12-18.PNG

result. We visualize representative edge maps in Figure 10 (see Figure 15 for more information). Qualitatively, even though the SAM was not trained for edge detection, it produces reasonable edge maps. Compared with the ground truth, SAM predicts more edges, including plausible edges not labeled in BSDS500. This bias is quantitatively reflected in Table 3: the high recall at 50% precision (R50) comes at the expense of precision. SAM naturally lags behind state-of-the-art methods that learn the BSDS500 bias, i.e. which edges to suppress. However, compared to pioneering deep learning methods such as HED [108] (also trained on BSDS500), SAM performs well and significantly outperforms previous zero-shot transfer methods, although admittedly outdated.

13-18.PNG

7.3. Zero-sample goal suggestion

method. Next, SAM is evaluated on the mid-level task of object proposal generation [2, 102]. This task has played an important role in object detection research as an intermediate step in pioneering systems such as [102, 41, 84]. To generate object proposals, we run a slightly modified version of the automatic mask generation pipeline and output the masks as proposals (see §D.3).

We compute the standard Average Recall (AR) metric on LVIS v1 [44]. We focus on LVIS because its large number of categories presents a challenging test. This is compared to a strong baseline implemented as a ViTDet [62] detector (using cascaded Mask R-CNN [48, 11] ViT-H). We note that this "baseline" corresponds to the "Detector Disguised as a Proposal Generator" (DMP) method [16] demonstrated for gaming AR, making it a really demanding comparison.
14-18.PNG

result. In Table 4, we unsurprisingly see that detections using ViTDet-H as object proposals (i.e., the DMP method [16] for game AR) perform best overall. However, SAM performs very well on several metrics. Notably, it outperforms ViTDet-H on medium and large objects as well as rare and common objects. In fact, SAM only underperforms ViTDet-H on small and frequent objects, where ViTDet-H can easily learn LVIS-specific annotation bias because unlike SAM, it is trained on LVIS. A comparison is also made with a disambiguated version of SAM ("single out."), which performs significantly worse than SAM on all AR metrics.

7.4. Zero-Sample Instance Segmentation

method. Speaking of higher-level vision, we use SAM as the segmentation module of the instance segmenter. The implementation is simple: we run an object detector (ViTDet used previously) and prompt the SAM with its output boxes. This illustrates how SAMs can be combined in a larger system.

15-18.PNG

result. We compare the masks predicted by SAM and ViTDet on COCO and LVIS in Table 5. Looking at the mask AP metric, we observe a gap on both datasets, SAM is fairly close, but definitely lags behind ViTDet. By visualizing the output, we observe that SAM masks are generally qualitatively better than those of ViTDet, with sharper boundaries (see §D.4, Figure 16). To investigate this observation, we conducted an additional human study in which annotators were asked to rate ViTDet masks and SAM masks on a previously used 1 to 10 quality scale. In Fig. 11, we observe that SAM consistently outperforms ViTDet in human studies.

16-18.PNG

Assuming that on COCO, the mask AP gap is large and the ground truth quality is relatively low (confirmed by human studies), ViTDet learns the specific bias of COCO masks. SAM is a zero-sample method that cannot exploit these (often undesirable) biases. The LVIS dataset has a higher quality ground truth, but still has specific properties (e.g., the masks do not contain holes, they are structurally simple polygons) and a deviation of the modality masks from the modality masks. Again, SAM is not trained to learn these biases, whereas ViTDet can exploit them.

7.5、Zero-Shot Text-to-Mask

method. Finally, consider a higher-level task: object segmentation from free-form text. This experiment is a proof-of-concept of the ability of SAM to process text cues. While we used the exact same SAM in all previous experiments, the training procedure for this SAM was modified to make it text-aware, but in a way that did not require new text annotations. For each manually collected mask with an area larger than 1002, a CLIP image embedding is extracted. Then, during training, the extracted CLIP image is embedded as the first interaction of the SAM. The key observation here is that because CLIP's image embeddings are trained to align with their text embeddings, we can train with image embeddings but use text embeddings for inference. That is, at inference time, we run the text through CLIP's text encoder and then feed the resulting text embeddings as hints to SAM (see §D.5 for details).
17-18.PNG

result. We show qualitative results in Figure 12. SAM can segment objects based on simple text cues such as "a wheel" as well as phrases such as "beaver tooth grate". When the SAM is unable to select the correct object from text cues alone, an extra point usually fixes the prediction, similar to [31].

7.6. Ablation research

We performed several ablations on the suite of 23 datasets using the single center point cue protocol. Recall that individual points may be ambiguous, and that this ambiguity may not be represented in the ground truth, each point containing only one mask. Since SAM is run in a zero-sample transfer environment, there may be a systematic bias between the top-level masks of SAM and the masks produced by the data annotation guidelines. Therefore, we additionally report the best mask (“oracle”) on the ground truth.

18-18.PNG

Figure 13 (left) plots the performance of SAM when trained on the accumulated data in the data engine stage. We observe that mIoU increases at each stage. When training using all three stages, the number of automatic masks greatly outnumbers manual and semi-automatic masks. To address this, we found that oversampling manual and semi-automatic masks by a factor of 10 during training yields the best results. This setup complicates training. Therefore, we tested a fourth setting, which only uses automatically generated masks. When using this data, the performance of SAM is only slightly lower (about 0.5 mIoU) than when using all the data. Therefore, by default, we only use auto-generated masks to simplify the training setup.

In Figure 13 (middle), we look at the effect of data size. The full SA-1B contains 11 million images, which we uniformly sample to 1M and 0.1M for ablation. Across 0.1 million images, we observe a large drop in mIoU across all settings. However, with 1 million images, about 10% of the entire dataset, we observe comparable results to those using the entire dataset. This data mechanism still includes about 100 million masks, which may be a practical setup for many use cases.

Finally, Fig. 13 (right) shows the results of the ViT-B, ViT-L and ViT-H image encoders. ViT-H has a significant improvement over ViT-B, but only a marginal gain over ViT-L. Further image encoder scaling doesn't seem to be fruitful at the moment.

8. Discussion

base model. Pretrained models have been adapted to downstream tasks since the early days of machine learning [99]. In recent years, with the increasing emphasis on scale, this paradigm has become increasingly important, and such models have recently been (re)called "base models": that is, "models trained on large-scale data and adapted to a wide range of downstream tasks” [8]. The work in this paper is well related to this definition, but notes that the underlying model for image segmentation is an inherently limited scope, as it represents an important but fractional subset of computer vision. One aspect of the approach is also contrasted with [8], which emphasizes the role of self-supervised learning in the underlying model. Although the model is initialized with self-supervised techniques (MAE [47]), the vast majority of its power comes from large-scale supervised training. In cases where a data engine can extend the available annotations, as in our case, supervised training provides an efficient solution.

Combination. Pretrained models can provide new capabilities even beyond what was imagined during training. A prominent example is how CLIP [82] can be used as a component in a larger system, such as DALL·E [83]. Our goal is to implement this combination directly with SAM. This paper aims to achieve this by asking SAMs to predict effective masks for a wide range of segmentation cues. The effect is to create a reliable interface between SAM and other components. For example, MCC [106] can easily use SAM to segment objects of interest and achieve strong generalization to unseen objects for 3D reconstruction from a single RGB-D image. In another example, a SAM can be prompted by gaze detected by a wearable device, enabling new applications. Due to the ability of SAMs to generalize to new domains such as egocentric images, such systems can work without additional training.

limitation. While SAM performed well overall, it wasn't perfect. It can miss fine structures, sometimes give the illusion of small disconnected components, and does not produce sharp boundaries like more computationally intensive methods of "zooming in" such as [18]. In general, we expect dedicated interactive segmentation methods to outperform SAMs, such as [67], when many points are provided. Different from these methods, SAM is designed to be general and use breadth instead of high IoU interactive segmentation. Furthermore, SAM can process hints in real time, but the overall performance of SAM is not real-time when using large image encoders. Our attempts at the text-to-mask task are exploratory and not entirely robust, although we believe it can be improved with more effort. While SAMs can perform many tasks, it is unclear how to design simple cues for semantic and panoramic segmentation. Finally, there are domain-specific tools such as [7], which we expect to perform better than SAM in their respective domains.

in conclusion. The Segment Anything project attempts to elevate image segmentation to the era of base models. The main contributions of this paper are a new task (promptable segmentation), model (SAM) and dataset (SA-1B) that make this leap possible. Whether SAM has reached the status of a base model remains to be seen how it is used in the community, but in any case we look forward to the prospect of this work, the release of more than 1B masks and our promptable segmentation model will help pave the way forward the way.

Reference citation link:

Guess you like

Origin blog.csdn.net/leiduifan6944/article/details/130080159