Interpretation of Segment Anything Model (SAM) paper

Insert image description here

I. Introduction

In this work, the authors aim to establish a basic model for image segmentation. That is, seek to develop a cue model and pretrain it on a broad dataset using a task that enables strong generalization. With this model, just-in-time engineering is used to solve a series of downstream segmentation problems on new data distributions.
The success of the program depends on three components: missions, models and data. To develop them, the authors addressed the following questions about image segmentation:
1. What tasks can achieve zero-shot generalization?
2. What is the corresponding model architecture?
3. What data can support this task and model?
These issues are intertwined and require comprehensive solutions. The authors first define a cued segmentation task that is general enough to provide a powerful pre-training target and support a wide range of downstream applications. This task requires a model that supports flexible prompting and can output segmentation masks in real time when prompted to allow interactive use. In order to train the model, a diverse and large-scale data source is required. Unfortunately, there is no network-scale segmentation of data sources; to solve this problem, the authors built a "data engine", that is, between using efficient models to assist data collection and using newly collected data to improve the models Iterate.
**Task:** Base models are a promising development in NLP and more recently computer vision, which can perform zero-shot and few-shot learning on new data sets and tasks through the use of "hint" techniques. Inspired by this line of work, the authors propose the cue segmentation task, whose goal is to return a valid segmentation mask given any segmentation cue (see Figure 1a). Prompts simply specify what is to be segmented in the image; for example, prompts can include spatial or textual information that identifies objects. Even if the hint is ambiguous and may refer to multiple objects (e.g., a point on a shirt may represent the shirt or the person wearing it), the output should be a reasonable mask for at least one of these objects. Use the cue segmentation task as a pre-training target and solve general downstream segmentation tasks through cue engineering.
**Model:** The promptable segmentation tasks and the goals of practical use impose constraints on the model architecture. In particular, the model must support flexible prompts and masks need to be calculated in real time to enable interactive use.
And must have fuzzy perception ability. Surprisingly, the authors found a simple design that satisfies all three constraints: a powerful image encoder that computes image embeddings, a cue encoder that embeds cues, and then combines both information sources in a lightweight mask decoder, which predicts segmentation masks. The author calls this model the Segment Anything model (SAM) (see Figure 1b). By splitting the SAM into an image encoder and a fast hint encoder/mask decoder, the same image embedding can be reused (and its cost amortized) with different hints. Given an image embedding, the hint encoder and mask decoder predict the mask from the hint in the web browser within 50ms. Focus on point, box, and mask prompts, and use free-form text prompts to present initial results. To make SAM aware of ambiguity, it is designed to predict multiple masks for a single cue, allowing SAM to naturally handle ambiguity, such as the shirt vs. person example.

**Data Engine:** In order to achieve strong generalization to new data distributions, it is necessary to train the SAM on a large and diverse mask set, not just any split dataset that already exists. While the typical approach to base models is to obtain the data online, the masks are not naturally rich, so an alternative strategy is needed. The authors' solution was to build a "data engine", that is, to co-develop the model with the model's annotations in a circular dataset (see Figure 1c). The data engine has three stages: assisted manual, semi-automatic and fully automatic. In the first stage, SAM helps annotators annotate masks, similar to classic interactive segmentation settings. In the second stage, SAM can automatically generate masks for a subset of objects by suggesting possible object locations, while the annotator focuses on annotating the remaining objects, helping to increase the diversity of the masks. In the final stage, SAM is prompted with a regular grid of foreground points, producing an average of 100 high-quality masks per image.
Insert image description here

**Dataset:** The final dataset SA-1B includes over 1B masks from 11M authorized and privacy-preserving images (see image above). SA-1B was collected fully automatically using the final stage of the Data Engine, with 400 more masks than any existing segmentation dataset, and the masks are verified to be of high quality and diversity. In addition to the robustness and generality of using SA-1B for training SAMs, the authors hope that SA-1B will become a valuable resource for research aimed at establishing new base models.
Experiments: The authors evaluate SAM extensively. First, using 23 different new segmentation datasets, we find that SAM produces high-quality masks from single foreground points, often only slightly lower than the manually annotated ground truth. Second, consistent strong quantitative and qualitative results are found on a variety of downstream tasks using just-in-time engineering under a zero-shot transfer protocol, including edge detection, object proposal generation, instance segmentation, and initial exploration of text-to-mask prediction. These results demonstrate that SAM can be used with out-of-the-box rapid engineering to solve a variety of tasks involving object and image distributions beyond the SAM training data.

二、Segment Anything Task

The authors draw inspiration from NLP, where tokens predict the next task for base model pre-training, and solve various downstream tasks through hint engineering. To establish a basic model for segmentation, the authors aimed to define a task with similar functionality.
The task begins by translating the concept of cues from NLP to segmentation, where cues can be a set of foreground/background points, a rough box or mask, free-form text, or in general, any information that indicates what is to be segmented in the image. Therefore, the task of hint segmentation is to return a valid segmentation mask given any hint. The requirement for a "valid" mask simply means that, even if the hint is ambiguous and can refer to multiple objects (e.g., recall the shirt vs. person example, see image below), the output should be a reasonable representation of at least one of those objects mask. This requirement is analogous to expecting language models to output consistent responses to ambiguous cues. This task was chosen because it brings a natural pre-training algorithm and a general way to transfer zero-shots to downstream segmentation tasks via hints.
Insert image description here
The pre-training cue segmentation task proposes a natural pre-training algorithm that simulates a sequence of cues (e.g., points, boxes, masks) for each training sample and compares the model's mask predictions to the ground truth. This approach was adopted from Interactive Segmentation, although unlike Interactive Segmentation where the aim is to eventually predict a valid mask after enough user input, the authors aim to always predict a valid mask for any cue. code, even if the prompt is vague. This ensures that pre-trained models are effective in use cases involving ambiguity, including automatic annotations required by the data engine. Performing well in this task is challenging and requires specialized modeling and training loss selection.
Zero-shot transfer Intuitively, the pre-training task gives the model the ability to respond appropriately to any cues at inference time, so downstream tasks can be solved by designing appropriate cues. For example, if there is a bounding box detector for cats, cat instance segmentation can be solved by providing the detector’s box output as a hint to the model. In general, many practical segmentation tasks can serve as cues.
related tasksSegmentation is a very broad field, including interactive segmentation, edge detection, super-pixelization, target suggestion generation, foreground segmentation, semantic segmentation, instance segmentation, panoramic segmentation, etc. The goal of the prompt segmentation task is to produce a broadly functional model that can be rapidly engineered to adapt to many (though not all) existing and new segmentation tasks. This ability is a form of task generalization. Note that this is different from previous work on multi-task partitioning systems. In a multi-task system, a single model performs a fixed set of tasks, such as joint semantic segmentation, instance segmentation, and panoptic segmentation, but the training and testing tasks are the same. This paper is that models trained for cue segmentation can be used as components in larger systems to perform new and different tasks at inference time, e.g., perform instance segmentation, cue segmentation models combined with existing object detectors.

三、Segment Anything Model

SAM has three components as shown in the figure below: an image encoder, a flexible cue encoder and a fast mask decoder. Built on the Transformer vision model with specific trade-offs for (amortized) real-time performance. Insert image description here
Image Encoder uses MAE-pretrained Visual Transformers (ViT), motivated by scalability and powerful pre-training methods, to handle high-resolution inputs with minimal adaptation. The image encoder runs once per image and can be applied before prompting the model.
The cue encoder considers two sets of cues: sparse (points, boxes, text) and dense (masks). Points and boxes are represented by positional encoding and learned embeddings are summed for each prompt type and free-form text using CLIP's off-the-shelf text encoder. Dense cues (i.e. masks) use convolutional embeddings and are summed element-wise across image embeddings.
Mask Decoder The mask decoder efficiently maps image embeddings, hint embeddings, and output tokens to masks. The design features a modification of the Transformer decoder block followed by a dynamic mask prediction header. The improved decoder block updates all embeddings using cue self-attention and cross-attention in both directions (cue to image embedding and from embedding to image). After running two blocks that upsample the image embeddings, the MLP maps the output labels to a dynamic linear classifier, which then computes the masked foreground probability for each image location.
Resolving Ambiguity For an output, if given an ambiguous hint, the model will average multiple valid masks. To address this issue, the model was modified to predict multiple output masks for a single prompt (see image below). 3 mask outputs are sufficient to solve most common cases (nested masks are usually up to three depths: whole, part, and subpart). During training, to rank masks, the model predicts a confidence score (i.e., estimated IoU) for each mask.

Insert image description here
Efficiency The entire model was designed largely with efficiency in mind. Given a precomputed image embedding, the hint encoder and mask decoder run in a web browser, on the CPU, in about 50ms. This runtime performance enables models to interact with prompts seamlessly and in real-time.
Loss and training A linear combination of focal loss and dice loss is used to supervise mask prediction. Training a cued segmentation task using a mixture of geometric cues.

四、 Segment Anything Data Engine

Since segmentation masks are not abundant on the Internet, the authors built a data engine to collect the 110 million mask dataset SA-1B. The data engine has three stages: (1) a model-assisted manual annotation stage (2) a semi-automatic stage that mixes automatically predicted masks and model-assisted annotation, and (3) a full model-generated mask without annotator input. automatic phase.
Assisted manual stage In the first stage, similar to classic interactive segmentation, a team of professional annotators label the mask by clicking on foreground/background object points using the browser-based interactive segmentation tool provided by SAM. The mask can be refined using the pixel-level Brush and Eraser tools. Model-assisted annotation runs in real-time directly in the browser (using pre-computed image embeddings), allowing for a truly interactive experience. The author imposes no semantic constraints on tagged objects, and the annotator is free to tag "stuff" and "things". The authors recommend that annotators mark objects that they can name or describe, but do not collect these names or descriptions. Annotators are asked to label objects in order of prominence and are encouraged to move on to the next image after mask annotation exceeds 30 seconds.
At the beginning of this phase, the SAM is trained using the public split dataset. After sufficient data annotation, the SAM is retrained using only the newly annotated masks. As more masks were collected, the image encoder was extended from ViT-B to ViT-H, and other architectural details evolved; the authors trained the model 6 times in total. As the model improved, the average annotation time per mask decreased from 34 seconds to 14 seconds. 14 seconds is 6.5 times faster than COCO’s mask annotation and only 2 times slower than bounding box labeling using extreme points. As SAM improves, the average number of masks per image increases from 20 to 44. In total, 4.3 million masks were collected from 120k images at this stage.
Semi-automatic stageAt this stage, the author's goal is to increase the diversity of masks to improve the model's ability to segment anything. To focus the annotator on less salient objects, confident masks are first automatically detected. The annotators were then shown images pre-populated with these masks and asked to annotate any other unannotated objects. To detect masks with confidence, a bounding box detector is trained on all first-stage masks using a common “object” category. At this stage, an additional 5.9 million masks were collected across 180k images (10.2 million masks in total). As in the first stage, the model is retrained periodically (5 times) on newly collected data. The average annotation time per mask returned to 34 seconds (excluding automatic masks) because these objects are more difficult to label. The average number of masks per image increased from 44 to 72 (including automatic masks).
Fully automated phase In the final phase, annotation is fully automated. This is possible because the model has two main enhancements. First, at the beginning of this stage, the author collected enough masks to greatly improve the model, including various masks from the previous stage. Second, by this stage, blur-aware models have been developed that are able to predict effective masks even in the presence of blur. Specifically, the model is prompted with a 32×32 regular grid and predicts for each point a set of masks that may correspond to valid objects. Using a fuzzy-aware model, if a point is located on a part or subpart, the model will return subparts, parts, and the entire object. The IoU prediction module of the model is used to select confidence masks; furthermore, only stable masks are identified and selected (masks are considered to be masked if thresholding the probability map at 0.5−δ and 0.5+δ results in similar masks). code is stable). Finally, after selecting confident and stable masks, non-maximum suppression (NMS) is applied to filter duplicates. To further improve the quality of smaller masks, multiple overlapping enlarged image crops are also processed.

5. Network structure details:

Image Encoder In general, an image encoder can be any network that outputs C×H×W image embedding blocks. Inspired by scalability and powerful pre-training, the authors use MAE-pretrained visual transformer (ViT) with minimal adaptation capabilities to handle high-resolution inputs, specifically with 14×14 window attention and four equidistant ViT-H/16 for global attention blocks. The output of the image encoder is a 16x reduced embedding of the input image. Since the runtime goal is to process each tip in real time, a large number of image encoder FLOPs can be fed since FLOPs are only computed once per image rather than per tip.
Following standard practice, an input resolution of 1024×1024 is used, which is obtained by rescaling the image and padding the short sides. Therefore, the image embedding is 64×64. To reduce the channel dimension, a 1×1 convolution is used to obtain 256 channels, and then a 3×3 convolution is used to obtain the same 256 channels. Each convolution is followed by a layer of normalization.

Hint Encoder Sparse hints are mapped to 256-dimensional vector embeddings. A point is represented as the sum of the positional encoding of the point's position and one of two learned embeddings indicating whether the point is in the foreground or background. A box is represented by an embedding pair: (1) its top-left position encoding added to a learned embedding representing “top-left corner”; (2) the same structure, but using a learned embedding representing “bottom-right corner”. Finally, to represent free-form text, use the text encoder from CLIP (generally any text encoder can be used). Dense cues (i.e. masks) have spatial correspondence with images. The mask is input at a resolution 4x lower than the input image, then downsized by an additional 4x using two 2×2, stride-2 convolutions with output channels 4 and 16 respectively. The final 1×1 convolution maps the channel dimension to 256. Each layer is separated by GELU activation and layer normalization. The mask is then added element-wise to the image embedding. If there is no mask hint, a learned embedding representing "no mask" is added to each image embedding position.

Lightweight Mask Decoder This module efficiently maps an image embedding and a set of hint embeddings to an output mask. To combine these inputs, we take inspiration from the Transformer segmentation model and modify the standard Transformer decoder. Before applying the decoder, a learned output token embedding is first inserted into the cue embedding set, which will be used in the output of the decoder. For simplicity, we refer to these embeds (excluding image embeds) collectively as "tokens"

Insert image description here

Our decoder design is shown above. Each decoder layer performs 4 steps: (1) self-attention on the token (2) cross-attention from the token (as query) to the image embedding (3) point-wise MLP update for each token, and (4) Cross-attention from image embedding (as query) to tokens. The final step updates the image embed with the tooltip information. During cross-attention, image embeddings are treated as a collection of 642 256-dimensional vectors. Each self/cross-attention and MLP has residual connections, layer normalization and dropout during training. The next decoder layer gets updated tokens and updated image embeddings from the previous layer. Use a two-layer decoder.
To ensure that the decoder has access to critical geometric information, position encoding is added to the image embedding whenever it participates in the attention layer. Furthermore, when the entire original cue tokens (including their positional encodings) participate in the attention layer, they are added back to the updated tokens. This allows strong dependence on the geometric location and type of cue markers.
After running the decoder, the updated image embedding is upsampled 4× with two transposed convolutions (it is now 4x smaller relative to the input image). The tokens then participate in the image embedding again, and the updated output token embedding is passed to a small 3-layer MLP that outputs vectors matching the channel dimensions of the upscaled image embedding. Finally, a mask with a spatial point-wise product between the upscaled image embedding and the output of the MLP is predicted. The converter uses an embedded size of 256. The converter MLP block has a large size of 2048, but MLP is only applied to relatively few (rarely larger than 20) hint tokens. However, in the cross-attention layer with 64×64 image embeddings, the channel dimensions of query, key and value are reduced by 2× to 128 to improve computational efficiency. All attention layers use 8 heads. The transposed convolution used to upscale the output image embedding is 2×2, stride 2, output channel size 64 and 32, and with GELU activation. They are separated by layer normalization.
Making the model ambiguity-awareAs mentioned above, a single input prompt can be ambiguous because it corresponds to multiple valid masks, and the model will learn to average over these masks. The authors eliminate this problem with a simple modification: use a small number of output tokens and predict multiple masks simultaneously instead of predicting a single mask. By default, three masks are predicted because the authors observed that three levels (whole, part, and subpart) are often sufficient to describe nested masks. During training, the loss is calculated between the ground truth and each predicted mask, but backpropagation is performed only from the lowest loss. This is a common technique used for models with multiple outputs. For use in applications, it is desired to sort the prediction masks, so we added a small header (operating on additional output tokens) that estimates the IoU between each prediction mask and the objects it covers.
There is much less ambiguity with multiple cues, and the three output masks often become similar. To minimize the computation of the degradation loss in training and ensure that a single blur-free mask receives a regular gradient signal, a single mask is only predicted when multiple cues are given. This is achieved by adding a fourth output token for additional mask prediction. The fourth mask is never returned for a single prompt and is the only mask returned for multiple prompts.

Guess you like

Origin blog.csdn.net/qq_52302919/article/details/132822918