Segment Anything Model

 Paper translation:

 

Figure 1: We aim to build a base model for segmentation by introducing three interrelated components: an on-the-fly segmentation task, a Segmentation Model (SAM) that supports data annotation and transfers zero samples to a series of tasks through on-the-fly engineering, and a segmentation model (SAM) for The data engine that collects SA-1B, our dataset with over 1 billion masks.

Abstract

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we build the largest segmentation dataset to date (so far), with over 1 billion tasks on 11 million licensed and privacy-respecting images. The model is designed and trained to be promptable, so it can transfer zero samples to new image distributions and tasks. We evaluate its capabilities on a number of tasks and find that its zero-shot performance is impressive, often competing with or even outperforming previous fully supervised results. At https://segment-anything.com we will facilitate research into fundamental models for computer vision.

1. Introduction

Language models pre-trained on large-scale web datasets are revolutionizing natural language processing with strong zero/few-shot generalization capabilities [10]. These "base models" [8] can generalize over tasks and data distributions far beyond those covered by the training tasks. This capability is often achieved through prompt engineering, which uses hand-written text to prompt language models to generate valid textual responses based on the task at hand. When these models are extended and trained with rich text corpora from the web, their zero-shot and few-shot performance compares surprisingly well to (and in some cases even matches) fine-tuned models [10, 21]. Empirical trends show that this behavior improves with increasing model size, dataset size, and total training computation [56, 10, 21, 51].

Although relatively few, foundational models have also been studied in computer vision. Perhaps the most prominent implementation example is pairing text and images from the web. For example, CLIP [82] and ALIGN [55] use contrastive learning to train text and image encoders to align the two modalities. Once trained, zero-shot generalization can be achieved on novel visual concepts and data distributions with engineered text cues. Such encoders can also be efficiently combined with other modules for downstream tasks such as image generation (e.g., DALL-E [83]). Despite great progress in vision and language encoders, computer vision involves a range of problems far beyond this scope, many of which lack abundant training data.

In this work, our goal is to build a base model for image segmentation. That is, we hope to develop a model that can be trained using hints and pretrained on a wide range of datasets to achieve strong generalization. With this model, we aim to solve the downstream segmentation problem for a series of new data distributions using hints.

The success of this program depends on three components: tasks, models, and data. In order to develop them, we need to answer the following questions about image segmentation:

  1. Which task achieves zero-shot generalization?
  2. What is the corresponding model architecture?
  3. What data can provide support for this task and model?

These problems are complex and require a comprehensive solution. We first define a general hintable segmentation task to provide a powerful pre-training objective and enable a wide range of downstream applications. This task requires the model to support flexible hints and be able to output segmentation masks in real-time when prompted for interactive use. In order to train our model, we need a diverse, large-scale data source. Unfortunately, there are currently no large-scale web data sources suitable for segmentation; to address this, we built a "data engine", which is our iterative process of improving the model by using efficient models to assist data collection and utilizing newly collected data . We next introduce each interrelated component, followed by the dataset we created and the experiments demonstrating the effectiveness of our method.

Tasks (Part II). In natural language processing, and more recently computer vision, foundational models are a promising development, often through the use of "hinting techniques" that can perform zero-shot and few-shot learning on new datasets and tasks. In this line of work, we propose the hintable segmentation task, whose goal is to return an efficient segmentation mask given any segmentation hint (see Figure 1a). A hint simply specifies the part of the image that needs to be segmented. For example, a hint might include spatial or textual information specifying the object. The requirement for an effective output mask is that even when the cue is ambiguous and may refer to multiple objects (e.g. a dot on a shirt may refer to the shirt itself or the person wearing it), the output should reasonably segment at least one of the objects mask. We use the hintable segmentation task as a pre-training objective, and solve the general downstream segmentation task through hint engineering.

Model (Part III). Hintable segmentation tasks and real-world usage goals impose constraints on the model architecture. In particular, the model must support flexible cues, segmentation masks need to be computed to support real-time interaction, and it must also be ambiguity-aware. Surprisingly, we find that a simple design satisfies all three constraints: a powerful image encoder for computing image embeddings, a hint encoder to embed hints, and then two sources of information in a lightweight mask are incorporated in the decoder to predict the segmentation mask. We call this model "Any Segment Model" (Segment Anything Model, SAM, see Figure 1b). By splitting the SAM into an image encoder and a fast hint encoder/mask decoder, the same image embedding can be reused (and its cost is amortized) with different hints. Given an image embedding, the hint encoder and mask decoder can predict a mask from hints in ~50ms in a web browser. We focus on point, box and mask cues, and also present preliminary results using free-form text cues. To make SAM ambiguity-aware, we design it to predict multiple masks for a single cue, enabling SAM to handle ambiguities naturally, such as the shirt vs. person example.

Data Engines (Part IV). To achieve strong generalization to new data distributions, we find that SAMs need to be trained on a large and diverse set of masks, beyond any existing segmentation dataset. While the typical approach for base models is to acquire data online (83), masks are not naturally abundant, so we need another strategy. Our solution was to build a "data engine" that we co-developed with annotations of datasets that model loops (see Figure 1c). Our data engine has three stages: assisted - manual, semi-automatic and fully automatic. In the first stage, SAM assists annotators in mask annotation, similar to the classic interactive segmentation setup. In the second stage, SAM can automatically generate masks for a part of objects by hinting possible object locations, while annotators focus on annotating the remaining objects, helping to increase mask diversity. In the final stage, we use a regular grid foreground point hint SAM, which can produce about 100 high-quality masks per image on average.

Datasets (Part V). Our final dataset SA-1B includes more than 11 million licensed and privacy-preserving images with over 1 billion masks (see Figure 2). The final stage using our data engine collected SA-1B fully automatically with 400 times more masks than any existing segmentation dataset [66, 44, 117, 60] and extensively validated that these masks Both quality and variety are high. In addition to being used to train SAMs for robustness and generalization, we hope that SA-1B will be a valuable resource for research aimed at building new foundational models.

Responsible AI (Part VI). We research and report on possible impartiality concerns and biases associated with the use of the SA-1B and SAM. The images in SA-1B cover different countries under various geographical and economic conditions, and we found that SAM performed similarly in different populations. We hope this makes our work fairer in practical application situations. We provide model and dataset cards in the appendix.

Experiments (Part VII). We performed an extensive evaluation of SAM. First, using a diverse new suite of 23 segmentation datasets, we find that SAMs can produce high-quality masks from a single foreground point, often only slightly inferior to manually annotated ground truth masks. Second, using hint engineering under a zero-shot transfer protocol, we obtain consistently strong quantitative and qualitative results in a variety of downstream tasks, including edge detection, object extraction, instance segmentation, and preliminary exploration of text-to-mask prediction, among others. These results demonstrate that SAMs can be used with the help of hint engineering to address various tasks with object and image distributions beyond the range of SAM's training data. However, there is still room for improvement, which we discuss in Section VIII.

release. We release the SA-1B dataset for research purposes and make SAM available under a permissive open license (Apache 2.0) at https://segment-anything.com. We also demonstrated the capabilities of SAM in an online demo.

Figure 2: Example images and superimposed masks from our newly introduced dataset SA-1B. SA-1B contains 11 million diverse, high-resolution, licensed and privacy-preserving images and 1.1 billion high-quality segmentation masks. These masks are annotated fully automatically by SAM, and we validate them to be of high quality and diversity through human ratings and extensive experiments. We group by the number of masks in each image for visualization (on average there are about 100 masks per image). 

 2. Segment Anything Task

We take inspiration from natural language processing with reference to the next token prediction task, which is used to pre-train the base model and solve various downstream tasks through hint engineering [10]. To build the base model for segmentation, we aim to define a similar task. Task. We start with the notion of a cue in natural language processing and apply it to segmentation, where a cue can be a set of foreground/background points, a rough box or mask, free text, or in general, any indication in the image information for segmentation. Thus, the hintable segmentation task is to return an efficient segmentation mask given any hint. The requirement of a "valid" mask simply means that the output should be a plausible mask for at least one object, even if the cue is ambiguous and may point to multiple objects (e.g. recall the shirt vs. person example, and see Figure 3). This requirement is similar to expecting a language model to output a coherent response to an indeterminate cue. We choose this task because it leads to a natural pre-training algorithm and a general method for zero-shot transfer to downstream segmentation tasks via hints.

Pre-training: The hintable segmentation task provides a natural pre-training algorithm that simulates sequential cues (such as points, boxes, and masks) for each training sample and compares the model's mask predictions with the ground truth Compare. We adapted this approach from Interactive Segmentation [109, 70], although unlike Interactive Segmentation whose goal is to eventually predict a valid mask after sufficient user input, our goal is to always predict a valid mask for any hint. Membrane, even if the hint is not clear. This ensures that the pretrained model is effective for use cases involving ambiguity, including the automatic annotation required by our data engine §4. We note that performing well on this task is challenging and requires specialized modeling and training loss choices, which we discuss in §3.

Related tasks: Segmentation is a broad field including interactive segmentation [57, 109], edge detection [3], superpixelation [85], object proposal generation [2], foreground segmentation [94], semantic segmentation [90] ], instance segmentation [66], panoptic segmentation [59], etc. The goal of our hintable segmentation task is to generate a model that has a broad ability to adapt to multiple (though not all) existing and new segmentation tasks, achieved through hint engineering. This ability is a form of task generalization [26]. Note that this is different from previous work on multi-task splitting systems. In a multi-task system, a single model performs a fixed set of tasks, such as joint semantic, instance, and panoptic segmentation [114, 19, 54], but the training and testing tasks are the same. An important distinction in our work is that a model trained for hintable segmentation can perform a new and different task at inference time by acting as a component in a larger system, e.g., to perform instance segmentation, a hintable segmentation model is compared with Combination of existing object detectors.

Discussion: Hints and combinations are powerful tools that enable a single model to be used in a scalable manner, potentially accomplishing tasks unknown at design time. This approach is similar to how other base models are used, e.g. CLIP [82] , the text-image alignment component of the DALL·E [83] image generation system. We anticipate that composable system design driven by techniques such as prompt engineering will enable broader applications than systems trained specifically for a fixed set of tasks. It is also interesting to compare promptable and interactive segmentation through a compositional perspective: while the interactive segmentation model is designed for human users, models trained for promptable segmentation can also be combined into larger algorithmic systems as we will show middle.

Figure 4: Overview of SAM (Segment Anything Model). Heavy image encoders output image embeddings that can be efficiently queried by various input cues to produce object masks at amortized real-time speed. For ambiguous cues corresponding to multiple objects, SAM can output multiple valid masks and associated confidence scores.

3. Segment Anything Model

Next, we describe the Segment Anything Model (SAM) that can prompt segmentation. SAM consists of three components, as shown in Figure 4: Image Encoder, Flexible Hint Encoder, and Fast Mask Decoder. We build on the Transformer vision model [14, 33, 20, 62] with specific trade-offs between the model's performance and real-time performance. We outline these components here and see §A for details.

Image Encoder: Inspired by scalability and powerful pre-training methods, we use a Vision Transformer (ViT) [33] pre-trained on MAE [47] and minimally modified to handle high-resolution inputs [ 62]. The image encoder only needs to be run once per image, which can be applied before prompting the model.

Hint Encoder: We consider two sets of hints: sparse (points, boxes, text) and dense (masks). We use positional encodings [95] and learned embeddings for each cue type and free text to represent points and boxes, and an off-the-shelf text encoder from CLIP [82] to represent free text. Dense cues (i.e. masks) use convolutional embeddings and are added element-wise with image embeddings.

Mask Decoder: A mask decoder efficiently maps image embeddings, cue embeddings, and output tokens onto a single mask. Inspired by [14, 20], this design employs a modification of the Transformer decoder block [103] followed by a dynamic mask prediction head. Our modified decoder block updates all embeddings using hinted self-attention and cross-attention in both directions (hint-to-image embedding and vice versa). After running both blocks, we upsample the image embeddings and use an MLP to map the output tokens to a dynamic linear classifier which then computes the masked foreground probability for each image location.

Disambiguation: When the model has only one output, given an ambiguity hint, the model will average multiple valid masks. To address this, we modify the model so that it predicts multiple output masks for a single cue (see Figure 3). We found that predicting 3 mask outputs is sufficient to cover most common cases (nested masks usually have up to three levels: whole, part, and subpart). During training, we only backpropagate through the minimum loss [15, 45, 64] for each mask. To rank masks, the model predicts a confidence score (i.e. estimated IoU) for each mask. In terms of efficiency, our overall model design is mainly driven by efficiency. Given precomputed image embeddings, the hint encoder and mask decoder run on a web browser, using CPU, and complete in about 50 ms. This runtime performance enables our model to achieve seamless, real-time interactive prompting.

Loss and Training: We supervise mask prediction using a linear combination of Focal Loss [65] and Dice Loss [73] used in [14]. For the cuedable segmentation task, we train using a combination of geometric cues (for textual cues, see Section 7.5). We simulate an interactive setting, randomly sampling cues for 11 rounds per mask, which enables the SAM to seamlessly integrate with our data engine, as also done in [92, 37].

4. Segment Anything Data Engine

Since segmentation masks are not plentiful on the Internet, we built a data engine in order to collect our 1.1B mask dataset SA-1B. This data engine consists of three stages: (1) a model-assisted manual annotation stage, (2) a semi-automatic stage in which a mixture of automatic predictive masks and model-assisted annotations is used, and (3) a fully automatic stage in which our model is Masks are generated without annotator input. Next, we describe each stage in detail.

Assisted manual phase : In the first phase, similar to traditional interactive segmentation, a team of professional annotators uses a browser-based interactive segmentation tool powered by SAM to label masks by clicking on foreground/background object points. Masks can be fine-tuned using pixel-accurate Brush and Eraser tools. Our model-assisted annotation can run directly in the browser in real-time (using precomputed image embeddings), enabling a truly interactive experience. We impose no semantic constraints on labeled objects, and annotators are free to label "substances" and "things" [1]. We suggested that annotators mark objects that they were able to name or describe, but we did not collect these names or descriptions. Annotators were asked to label objects in salient order and encouraged to move on to the next image in the event of labeling by one mask for more than 30 s.

At the beginning of this phase, the SAM is trained using common public segmentation datasets. After sufficient data annotation, the SAM is only retrained using the newly annotated masks. As more and more masks are collected, the image encoder scales from ViT-B to ViT-H, and other architectural details change; in total we retrain our model 6 times. As the model improves, the average annotation time per mask decreases from 34 seconds to 14 seconds. We note that 14 seconds is 6.5x faster than COCO [66] for mask annotation and only 2x slower than extreme point bounding box labeling [76, 71]. With the improvement of SAM, the average number of masks per image increases from 20 to 44. Overall, we collected 4.3M masks from 120k images at this stage.

Semi-automatic stage: In this stage, we aim to increase the diversity of masks to improve the ability of our model to segment anything. To allow annotators to focus on less salient objects, we first automatically detect confident masks. We then show annotators images with these masks and ask them to annotate any unannotated objects. To detect confident masks, we train a bounding box detector [84] on all first-stage masks using the generic “object” category. At this stage, we collected another 5.9M masks (total 10.2M masks) in 180k images. As in the first stage, we periodically retrain our model (5 times) with newly collected data. The average annotation time per mask went back to 34 seconds from 14 seconds excluding automatic masks, as these objects are harder to annotate. The average number of masks per image has been increased from 44 to 72 (including automatic masks).

Fully automatic stage: In the last stage, annotation is fully automatic. This is thanks to two main enhancements of our model. First, at the beginning of this stage, we have collected enough masks to improve the model considerably, including diversification masks from the previous stage. Second, by this stage we have developed models that are ambiguity-aware, which allows us to predict effective masks in case of ambiguity. Specifically, we prompt the model with a regular grid of 32×32 points, and for each point predict a set of masks that likely correspond to valid objects. Ambiguity-aware models can return sub-subsections, subsections, and whole objects at subsections or subsubsections. The IoU prediction module of our model is used to select confident masks; moreover, we only identify and select stable masks (if the probability map is segmented at 0.5−δ and 0.5+δ, the results are similar, we consider A mask is stable). Finally, after selecting a confident and stable mask, we apply non-maximum suppression (NMS) to it to filter duplicate masks. To further improve the quality of smaller masks, we also handle multiple overlapping crops of the enlarged image. See §B for more details on this phase. We applied fully automatic mask generation to all 11M images in our dataset, yielding a total of 1.1B high-quality masks. Next, we describe and analyze this dataset, SA-1B.

5. Segment Anything Dataset

Our SA-1B dataset consists of 11M diverse, high-resolution, licensed and privacy-preserving images and 1.1B high-quality segmentation masks collected using our data engine. We compare SA-1B with existing datasets and analyze mask quality and properties. We will release SA-1B to aid in the development of future fundamental models for computer vision. We note that SA-1B will be released under a favorable license agreement and researcher safeguards for certain research uses.

Images: We licensed a new set of 11 million images from a vendor who works directly with photographers. The images are of high resolution (average 3300 × 4950 pixels), and the resulting data size can pose accessibility and storage challenges. Therefore, we publish the downsampled image with the shortest side set to 1500 pixels. Even after downsampling, our image resolution is still much higher than many existing vision datasets (e.g., COCO [66] images are ∼480×640 pixels). Note that most models today work with input resolutions much lower than this. Faces and vehicle license plates have been blurred in the posted images.

Mask Quality: To estimate mask quality, we randomly sample 500 images (∼50k masks) and ask our professional annotators to improve the quality of all masks in these images. Annotators make corrections using our models and pixel-accurate 'Brush' and 'Eraser' editing tools. This process produced a comparison of automatically predicted and professionally corrected masks. We calculated the IoU between each pair of masks and found that 94% of the comparisons had an IoU greater than 90% (97% of the comparisons had an IoU greater than 75%). For comparison, earlier work estimated inter-annotator agreement at 85-91% IoU [44, 60]. Our experiments in Section 7 confirm with human scoring that masks are of high quality relative to a variety of datasets, and that model training performs reasonably well between automatic masks and those generated by all data engines.

Mask properties: In Fig. 5, we plot the spatial distribution of object centers in SA-1B for comparison with the largest existing segmentation dataset. A common photographer bias is present in all datasets. We observe that SA-1B covers a larger range of image corners, while COCO [66] and Open Images V5 [60] have more Significant central bias. In Figure 6 (legend), we compare these datasets by size. Compared with the second largest Open Images, SA-1B has 11 times more images and 400 times more masks. On average, it has 36 times more masks per image than Open Images. The closest dataset in this regard, ADE20K, still has 3.5 times fewer masks per image. Figure 6 (left) plots the mask distribution for each image, and next we look at the image relative mask size (mask area divided by the square root of the image area) in Figure 6 (middle). As expected, since our dataset has more masks per image, it also tends to include a higher proportion of small and medium masks. Finally, to analyze shape complexity, we looked at mask concavity (1 minus mask area divided by convex hull area) in Figure 6 (right). Since shape complexity is correlated with mask size, we control the mask size distribution of the dataset by first stratified sampling from mask-sized subsections. We observe that the concavity distribution of our masks is similar to that of other datasets.

6. Segment Anything RAI Analysis

Next, we conduct a responsible artificial intelligence (RAI) analysis of our work by investigating possible fairness issues and biases when using SA-1B and SAM. We focus on the geographic and income distribution of the SA-1B and the fairness of the SAM in terms of people's protected attributes. In Section F, we also provide datasets, data annotations, and model cards.

Geography and provenance
We infer that images of these countries were captured using standard methods. In Figure 7, we visualize the image counts for each country in SA-1B (left) and the top 50 countries with the most images (right). We note that the first three countries come from different regions of the world. SA-1B has a much higher percentage of images in Europe, Asia, Oceania, and middle-income countries, with at least 28 million masks in all regions including Africa, which is fairly consistent across regions and incomes (Each image mask is 94-108)

insert image description here Fairness in Segmentation

We investigated potential equity issues between perceived gender representation, perceived age group, and perceived skin color by measuring differences in SAM performance between groups. We use the More Inclusive Population Annotation (MIAP) [87] dataset for gender representation and age, and a proprietary dataset for skin color. SAM performs best on those considered older (albeit with large confidence intervals)

7. Zero-Shot Transfer experiment

This section uses the SAM model to conduct Zero-Shot Transfer experiments: 5 tasks are considered, 4 of which are significantly different from the fast segmentation tasks used to train SAM, and the data sets and tasks not seen during training are evaluated. SAM model (follows CLIP). The dataset may include new distributions of images (such as underwater or egocentric images) that do not appear in SA-1B.

Our experiments first test the core goal of hintable segmentation: generating an efficient mask from any hint. We emphasize the challenging scenario for a single foreground point cue, as it is more likely to be ambiguous than other, more specific cue. Subsequently, we prompt the SAM to (1) perform edge detection, (2) segment everything, i.e. object proposal generation, (3) segment detected objects, i.e. instance segmentation, and (4) as a proof of concept, learn from free-form Segment objects in text. These four tasks are significantly different from the cueable segmentation tasks trained by SAM and realized by cue engineering. Our experiment ends with an ablation study.

(1) SAM uses a ViT-H [33] image encoder pre-trained by MAE [47]
(2) SAM is trained on SA-1B, and the dataset only includes masks automatically generated by the last stage of the data engine.

1. Zero shot single-point effective mask evaluation

Task: Evaluate segmenting an object from a single foreground point, since a point can refer to multiple objects. Label masks in most datasets do not enumerate all possible masks, which can make automatic metrics unreliable. Therefore, we supplement the standard mIoU metric (i.e., the average of all IoUs between prediction and label masks), in which annotators rate mask quality from 1 (meaningless) to 10 (pixel perfect).

By default, we sample points from the "center" of the label mask (the maximum of the distance transform inside the mask), following the standard evaluation protocol in Interactive Segmentation, by default only evaluating the most certain mask in the model. The baselines are all single-mask methods. We mainly compare with RITM [92], which is a strong interaction segmenter.

Dataset: Use a newly compiled set of 23 datasets with different image distributions for mIoU evaluation

The figure above shows pointing mask evaluation on 23 datasets. (a) Average MIOU of SAM and RITM of the strongest single-point segmenter. Due to ambiguity, a mask may not match GT; circles show the most relevant "oracle" results among the 3 predictions of SAM. (b) Comparing annotators' mask quality ratings from 1 to 10 (worst) (best) per dataset. All methods use the center of the GT mask as a cue. (c,d) mIoU with different number of points. SAM significantly outperforms previous interactive segmenters by 1 point and scores more. A low absolute mIoU of 1 point is the result of ambiguity.

insert image description here

2. Zero shot target proposal

Next, we evaluate SAM on the mid-level task of object proposal generation. This task has played an important role in object detection research as an intermediate step in groundbreaking systems. To generate object proposals, we run a slightly modified version of the automatic mask generation pipeline and output the masks as proposals.
We computed the standard Average Recall (AR) metric on LVIS v1. We focus on LVIS because its large number of categories is a challenging test. We compare against a strong baseline implemented as a ViTDet detector (ViT-H with cascaded mask R-CNN).

Results: In Table 4, we can unsurprisingly see that using detections from ViTDet-H as object proposals (i.e., the DMP method [16] for game AR) performs best overall. However, SAM does very well on several metrics. Notably, it outperforms ViTDet-H on both medium and large objects as well as rare and common objects. In fact, SAM underperforms ViTDet-H only on small and frequent objects, where ViTDet-H can easily learn LVIS-specific annotation biases because it is trained on LVIS, unlike SAM. We also compare the ablated blurred version of SAM, which performs significantly lower than SAM on all AR metrics.

3.zero shot text to mask

Finally, we consider a higher-level task: object segmentation from free-form text. This experiment demonstrates the ability of SAM to handle text cues. While we have used the exact same SAM in all previous experiments, for this experiment the SAM's training procedure was modified so that it is text-aware, but in a way that does not require new text annotations. Specifically, for each manually collected mask with an area larger than 1002, we extracted CLIP image embeddings. Then, during training, we embed the extracted CLIP image as the first interaction, prompting the SAM. The key observation here is that since CLIP's image embeddings are trained to align with text embeddings, we can use image embeddings for training but text embeddings for inference. That is, at inference time, we run the text through CLIP's text encoder, then feed the resulting text embeddings to SAM as prompts.

insert image description here Results We show qualitative results in the figure above. SAM can segment objects based on simple text cues such as "wheels" and phrases such as "beaver tooth grilles." An extra point hint can help when SAM cannot select the correct object from text hints alone, similar to [PhraseClick].

4.zero shot Edge Detection

similar not to discuss

5. Zero-Shot Instance Segmentation

similar not to discuss

8. Discussion

1. Basic model

Since the early days of machine learning, pretrained models have been adapted for downstream tasks. In recent years, with the increasing emphasis on scale, this model has become increasingly important, and this type of model has recently been (re)named "base model": that is, "trained on extensive data at large scale, and adapted to A wide range of downstream tasks" model. Our work correlates well with this definition, although we note that the underlying model for image segmentation is inherently of a limited scope, as it represents an important, but fractional, subset of computer vision. We also contrast one aspect of our approach with [8], [On the opportunities and risks of foundation models] highlighting the role of self-supervised learning in foundation models. Although our model is initialized with self-supervised techniques (MAE), the vast majority of its power comes from large-scale supervised training. In cases where a data engine can extend the available annotations, supervised training provides an effective solution.

2. Combination

Pre-trained models can enhance new abilities, even beyond what people imagined at the time of training. A striking example is how CLIP is used as a component in a larger system, such as the DALL·E. We aim to do this by asking the SAM to predict an efficient mask for a wide range of segmentation cues. The effect is to create a reliable interface between SAM and other components. For example, MCC can easily use SAM to segment objects of interest and achieve strong generalization to 3D reconstruction from a single RGB-D image. In another example, the SAM can be prompted by gaze detected by the wearable device, thereby enabling new applications. Since SAMs can generalize to new domains like ego-centric images, such systems can work without additional training.

3. Restrictions

While SAM performed well overall, it wasn't perfect. It can miss fine structures, sometimes hallucinate small disconnected components, and doesn't produce boundaries as sharply as computationally intensive methods like "zoom-in". In general, we expect dedicated interactive segmentation methods to outperform SAM when many points are provided. Unlike these methods, SAM is designed for generality and breadth of use, rather than high IoU interaction segmentation. Furthermore, SAM can process hints in real-time, but when using a heavy image encoder, the overall performance of SAM is not real-time. Our attempt at the text-to-mask task is exploratory and not fully robust, although we believe it can be improved with more effort. While SAMs can perform many tasks, it is unclear how to design simple cues for semantic and panoramic segmentation. Finally, there are domain-specific tools such as [ilastik: interactive machine learning for (bio)image analysis.], which we expect to outperform SAM in their respective domains.

Summarize

The Segment Anything project is an era of attempts to improve image segmentation to the underlying model. Our main contributions are a new task (hintable segmentation), model (SAM) and dataset (SA-1B) that make this leap possible. Whether SAM has reached the status of a base model remains to be seen for its use in the community, but whatever perspective we expect from this work, the release of more than 1B masks, and our fast segmentation model will help pave the way.
 

There are too many things to learn, let’s take a look at it first, coo coo coo. . . . . to be continued

Guess you like

Origin blog.csdn.net/qq_38915354/article/details/130068960