Segment Anything reading notes

Segment Anything

Summary

Experience Address
Paper Address
Code Address
This article has recently become popular and has more than 10,000 stars
insert image description here

  • Large-scale dataset (more than 1 billion masks)
  • Can do zero-shot tasks

introduction

The authors ask three questions:
Image segmentation success depends on three components: task, model, and data. This paper addresses the following questions about image segmentation:

  1. What tasks can achieve zero-shot generalization?
  2. What is the corresponding model architecture?
  3. What kind of data can support this task and model
    insert image description here

Task

In NLP and more recently computer vision, foundational models are a promising development that can perform zero- and few-shot learning on new datasets and tasks by using "hinting" techniques. Inspired by this work, we propose the hintable segmentation task, whose goal is to return an efficient segmentation mask (above a) given any segmentation hint. Prompts simply specify what is to be segmented in the image, for example, prompts can include spatial or textual information identifying objects. The requirement for a valid output mask means that even if the prompt is ambiguous and may point to multiple objects (e.g. a dot on a shirt might indicate a shirt or a person wearing a shirt), the output should be of at least one of these objects Reasonable mask. We use the hint segmentation task as a pre-training objective, and solve the general downstream segmentation task through hint engineering.

Model

Hintable segmentation tasks and practically used targets impose constraints on the model architecture. In particular, the model must support flexible hints, require real-time amortized computation of masks to allow interactive use, and must be able to recognize ambiguities. Surprisingly, we find that a simple design satisfies all three constraints: a powerful image encoder computes image embeddings, a hint encoder embeds hints, and then combines the two sources of information in a lightweight mask In the code decoder, the segmentation mask is predicted. We call this model the Segment Anything model, or SAM (see Figure b). By separating the SAM into an image encoder and a fast hint encoder/mask decoder, the same image embedding (and its cost amortization) can be reused with different hints.

Given an image embedding, the hint encoder and mask decoder predict the mask from the hint in a web browser in ~50ms. We mainly focus on point, box and mask cues, and use free-form text cues to present initial results. To enable SAM to recognize ambiguity, we design it to predict multiple masks for a single cue, allowing SAM to handle ambiguity naturally, such as the shirt vs. person example.

Data engine

To achieve strong generalization to new data distributions, we found it necessary to train the SAM on a large and diverse set of masks rather than any pre-existing split dataset. While the typical approach for base models is to get data online, masks are naturally not abundant, so we need an alternative strategy. Our solution is to build a "data engine", i.e. we co-develop our models with annotations on recurrent datasets. Our data engine has three stages: assisted manual, semi-automatic and fully automatic. In the first stage, SAM helps annotators to annotate masks, similar to the classic interactive segmentation setup. In the second stage, SAM can automatically generate masks for a subset of objects by hinting at its possible object locations, while the annotator focuses on annotating the remaining objects, which helps to increase the diversity of masks. In the final stage, we prompt the SAM with a regular grid of foreground points, yielding on average about 100 high-quality masks per image.
insert image description here

split task

insert image description here
Taking inspiration from NLP, where the next token prediction task is used for base model pre-training, and various downstream tasks are solved through hint engineering. To build a base model for segmentation, our goal is to define a task with similar functionality.

Task
We start by translating the idea of ​​a cue from NLP to segmentation, where a cue can be a set of foreground/background points, a rough box or mask, free-form text, or in general, an indication of what to segment in the image. any information about the content. The hint segmentation task, then, is to return a valid segmentation mask given any hint. The requirement of a "valid" mask simply means that, even if the prompt is ambiguous and may involve multiple objects (e.g., recall the shirt vs. person example, see Figure 3), the output should be at least A reasonable mask for one. This requirement is similar to expecting a language model to output consistent responses to ambiguous prompts. We choose this task because it leads to a natural pre-training algorithm and a general method for zero-shot transfer to downstream segmentation tasks via hints.
Pre-training
The hint segmentation task proposes a natural pre-training algorithm that simulates a sequence of hints (e.g., points, boxes, masks) for each training sample and compares the model's mask predictions with the ground truth . We adopted this approach from Interactive Segmentation, although unlike Interactive Segmentation whose goal is to eventually predict a valid mask after enough user input, our goal is to always predict a valid mask for any hint, even if the hint is blurry. This ensures that pretrained models are effective in use cases involving ambiguity, including automatic annotation as required by our data engine. We note that performing well on this task is challenging and requires specialized modeling and training loss selection.
Zero-shot transfer
Intuitively, our pre-training task endows the model with the ability to respond appropriately to any cues at inference time. capabilities, and thus downstream tasks can be solved by designing appropriate cues. For example, if there is a bounding box detector for cats, cat instance segmentation can be solved by providing the detector's box output as hints to our model. In general, a large number of practical segmentation tasks can be used as hints. In addition to automatic dataset labeling, we explore five different example tasks in our experiments.
related tasks
Segmentation is a broad field: there is interactive segmentation, edge detection, superpixelation, object proposal generation, foreground segmentation, semantic segmentation, instance segmentation, panoptic segmentation, etc.
The goal of our hint segmentation task is to produce a broadly capable model that can be adapted to many (though not all) existing and new segmentation tasks through just-in-time engineering. This ability is a form of task generalization. Note that this is different from previous multitasking splitting systems. In a multi-task system, a single model performs a fixed set of tasks, such as joint semantic, instance, and panoptic segmentation, but the training and testing tasks are the same. In our work, an important distinction is that a model trained for hint segmentation can perform a new and different task at inference time as a component in a larger system, e.g., to perform instance segmentation, a hint segmentation model Combined with existing object detectors.
discuss

Hints and combinations are powerful tools that allow a single model to be used in a scalable manner, potentially accomplishing tasks not known at the time the model was designed. This approach is similar to how other base models are used, such as CLIP, the text-image alignment component of the DALL·E image generation system. We anticipate that composable system designs driven by techniques such as hint engineering will enable broader applications than systems trained exclusively for a fixed set of tasks. It is also interesting to compare cue segmentation and interactive segmentation from a compositional perspective: while the interactive segmentation model is designed with human users in mind, models trained for cue segmentation can also compose a larger algorithmic system, as we will as demonstrated.

segmentation model

insert image description here
SAM has three components, as shown in Figure 4: an image encoder, a flexible hint encoder and a fast mask decoder. We build on the Transformer vision model with specific tradeoffs for real-time performance.

  • Image encoder using MAE
  • Prompt Encoder: Sparse prompts (points, boxes, text) and dense prompts (masks). Points and boxes are represented by positional encoding and learned embeddings are performed using CLIP's off-the-shelf text encoders for each cue type and free-form text. Dense cues (i.e. masks) use convolutional embeddings and are summed with image embedding elements.
  • MASK decoder: Transformer decoder block followed by dynamic mask prediction head, using hint self-attention and cross-attention (hint to image embedding and vice versa) in both directions to update all embeddings, after running both blocks, image embedding For upsampling, the MLP maps the output tokens to a dynamic linear classifier and then computes the masked foreground probability for each image location.
  • Resolve ambiguity: the model predicts a confidence score for each mask (i.e. estimated IoU)
  • Loss function: dice loss and focal loss

Segment Anything Data Engine

Assisted-manual phase
In the first phase, similar to classic interactive segmentation, a group of professional annotators uses a browser-based interactive segmentation tool supported by SAM to label masks by clicking on foreground/background object points. Masks can be refined using pixel-accurate Brush and Eraser tools. Our model-assisted annotation runs in real-time directly in the browser (using precomputed image embeddings), allowing for a truly interactive experience. We impose no semantic constraints on labeled objects, and annotators are free to label "stuff" and "things". We suggested that annotators mark objects they could name or describe, but did not collect these names or descriptions. Annotators are asked to annotate objects in salient order and are encouraged to spend more than 30 seconds in one mask to annotate the next image.
At the beginning of this phase, the SAM is trained using a public split dataset. After enough data is labeled, only the newly labeled MASK is used to retrain the SAM. As more and more MASKs are collected, the image encoder scales from ViT-B to ViT-H, and other architectural details evolve; we retrain our model 6 times in total. As the model improves, the average annotation time per mask is reduced from 34 seconds to 14 seconds. We note that 14 seconds is 6.5 times faster than COCO mask labeling and only 2 times slower than bounding box labeling of extreme points. With the improvement of SAM, the average number of masks per image increases from 20 to 44. Overall, we collected 4.3 million masks from 120,000 images at this stage.
semi-automatic stage
At this stage, our goal is to increase the diversity of MASK to improve the ability of our model to segment anything. To focus the annotator on less salient objects, we first automatically detect confident masks. We then showed annotators images pre-populated with these masks and asked them to annotate any other unannotated objects. To detect plausible masks, we train a bounding box detector on all first-stage masks using a common "object" category. At this stage, we collected an additional 5.9 million masks (total 10.2 million masks) in 180k images. As in the first stage, we periodically retrain our model (5 times) on newly collected data. The average annotation time per mask is back to 34 seconds (excluding automatic masking), as these objects are more challenging to label. The average number of masks per image has been increased from 44 to 72 (including automatic masks).
Fully automatic stage
In the final stage, annotation is fully automatic. This is possible because our model has two major enhancements. First, at the beginning of this stage, we collected enough masks to improve the model greatly, including various masks from the previous stage. Second, at this stage we have developed blur-aware models that allow us to predict effective masks in the presence of blur. Specifically, we hint the model with a 32×32 regular grid of points and predict for each point a set of masks that likely correspond to valid objects. For models that recognize ambiguity, if a point lies on a part or subpart, our model returns subparts, parts, and whole objects. The IoU prediction module in the model is used to select confident masks; moreover, we only identify and select stable masks (if thresholding the probability map at 0.5−δ and 0.5+δ produces similar masks, then we The mask is considered stable). Finally, after selecting a confident and stable mask, non-maximum suppression (NMS) is applied to filter duplicates. To further improve the quality of the smaller masks, we also processed multiple overlapping crops of the enlarged image.

Ablation experiment

insert image description here

Guess you like

Origin blog.csdn.net/qq_45745941/article/details/130003619