Segment Anything: Analysis of new image segmentation technology that breaks through the boundary

Segment Anything paper address: https://arxiv.org/pdf/2304.02643.pdf

Paper background

In natural language processing, significant progress has been made in zero-shot and few-shot learning based on large-scale language models. In computer vision, such as CLIP and ALIGN, zero-shot generalization to new visual concepts can be achieved through engineered text cues. In this paper, we propose the hintable segmentation task, which returns effective segmentation masks based on segmentation hints. Hints can contain spatial or textual information and are used to identify objects in an image. A valid output mask means that even if the cues are ambiguous and may point to multiple objects (e.g. a point on clothing could represent a shirt or a person wearing a shirt), the output mask should produce a plausible segmentation of at least one of these objects . We use the hintable segmentation task as a pre-training target and solve various downstream segmentation tasks through hint engineering.

Segment Anything Model (SAM)
can prompt segmentation tasks and practical applications require the model architecture to have certain constraints. In particular, models must support flexible hints, be able to compute segmentation masks in real-time for interactive use, and must have cognitive capabilities for ambiguous situations. Surprisingly, the authors found that a simple design can satisfy all these constraints: a powerful image encoder computes image embeddings, a hint encoder encodes the hints, and then combines these two sources of information in a lightweight mask decoder that predicts segmentation masks . The boss calls this model the Segment Anything model, or SAM for short. By decomposing the SAM into an image encoder and a fast hint encoder/mask decoder, the same image embedding can be reused (and cost-amortized) for computation of different hints. Given an image embedding, the hint encoder and mask decoder can predict a mask from hints in about 50 ms in a web browser. We focus on point, box and mask cues and show preliminary results using free text cues. To equip SAM with the ability to recognize ambiguity, we design it to predict multiple masks for a single cue, thereby naturally handling ambiguous situations, such as the shirt vs. person example.

The Segment Anything data engine includes three stages of assisted manual, semi-automatic and fully automatic to construct large-scale and diverse mask data. SA-1B is a dataset generated through a fully automated stage of the data engine, containing 1 billion masks, 400 times more than existing datasets, with high quality and diversity. SA-1B is intended to be used to train SAMs with robustness and generalization capabilities, and to be an invaluable resource for building new base models.

Segment Anything task method

To train the model, the researchers came up with a pre-training algorithm that simulates a sequence of prompts and compares the model's predictions with the true masks. Unlike interactive segmentation, the goal of this approach is to always generate a valid segmentation mask given any hint. This ensures the effectiveness of pre-trained models in various application scenarios including automatic annotation.

Another advantage of this technique is the zero-sample transfer capability. With the pre-training task, the model is able to adapt to different hints at inference time, so it can solve various practical segmentation tasks by designing appropriate hints. For example, if you already have a bounding box detector for detecting cats, you can implement cat instance segmentation by using the detector's output bounding boxes as hints. This approach enables a wide range of segmentation tasks to be considered as part of hint engineering.

The emergence of Segment Anything technology has brought new ideas and possibilities to the field of image segmentation. Through hint engineering and combination methods, a single model can be applied to various tasks in a scalable manner and adapted to unknown tasks. This approach lays the foundation for building more flexible and adaptable segmentation systems.

In the future, we can further explore the potential of hint engineering and combination methods to advance the development and application of image segmentation techniques.

Segment Anything Model Architecture

insert image description here

The Segment Anything Model (SAM) for hintable segmentation tasks consists of three components: an image encoder, a flexible hint encoder, and a fast mask decoder . It
builds SAMs based on Transformer visual models (such as ViT), with real-time performance tradeoffs. The following is a detailed introduction to these components.

Image Encoder

The image encoder uses a Vision Transformer (ViT) pre-trained with MAE and minimally tuned to handle high-resolution input. The image encoder is only run once and can be applied before hinting to the model.

Prompt Encoder

The hint encoder considers two types of hints: sparse (points, boxes, text) and dense (masks). We represent points and boxes using positional encodings, summed with learned embeddings for each cue type and free-form text. Dense cues (i.e. masks) are embedded using convolutions and summed element-wise with the image embedding.

Mask Decoder

A mask decoder efficiently maps image embeddings, cue embeddings, and output tokens to masks. Inspired by previous research, the design uses a modified version of the Transformer decoder block followed by a dynamic mask to predict the header. Our modified decoder block updates all embeddings using self-attention and cross-attention (cues to image embeddings and vice versa). After running both blocks, we upsample the image embeddings and map the output labels to a dynamic linear classifier via MLP, computing the masked foreground probability for each image location.

insert image description here

The function of the hint encoder and mask decoder here is similar to the mask mechanism in the BERT model. They provide the model with information about the parts that need to be occluded through sparse (point, box, text) or dense (mask) hints. This occlusion enables the model to perform self-supervised learning, improving the ability to segment objects of interest in images by learning what is occluded.

insert image description here

remove ambiguity

When given ambiguous cues, the model averages over multiple effective masks. In order to solve this problem, Daxie modified the model so that it can predict multiple output masks for a single cue , as shown in the figure below, we found that outputting 3 masks is enough to solve most common cases (there are usually up to three layers of nested masks: whole, part and subpart). During training, we only backpropagate the mask with the smallest loss. To rank masks, the model predicts a confidence score (like estimated IoU) for each mask.

insert image description here

Efficiency

The overall model design takes efficiency into consideration to a large extent. Based on precomputed image embeddings, the hint encoder and mask decoder can run in about 50 ms on a CPU in a web browser .

loss and training

During the training process of the model, a linear combination of focal loss and Dice loss is used as the loss function. These two loss functions combine the characteristics of object detection and segmentation tasks, which can effectively guide the model to learn the correct segmentation results.

Focal Loss:

Focal loss is a loss function for the class imbalance problem, which can help the model to better handle samples that are easy to classify and difficult to classify.

The calculation formula of focal loss is as follows:

F L ( p t ) = − α t ( 1 − p t ) γ l o g ( p t ) FL(p_t) = -\alpha_t(1-p_t)^\gamma log(p_t) FL(pt)=at(1pt)γlog(pt)
where:
pt p_tptIndicates the predicted probability of the model;
α t \alpha_tatIs a balance parameter, used to adjust the weight of easy classification and difficult classification samples;
γ \gammaγ is a tunable exponential parameter that adjusts the decay rate of focus loss.

Through the definition of the focus loss formula, it can be seen that when the sample is difficult to classify, the value of the loss function will be larger, thereby increasing the attention to such samples, so that the model can better learn difficult samples.

Dice loss (Dice Loss)

Dice loss is a commonly used indicator for evaluating the accuracy of segmentation tasks, which measures the similarity between the model prediction results and the real segmentation results.

The calculation formula of Dice loss is as follows:
D ice ( p , g ) = 2 ∣ p ⋂ g ∣ ∣ p ∣ + ∣ g ∣ Dice(p,g) = \frac{2|p \bigcap g|}{|p|+|g|}D i ce ( p ,g)=p+g2∣pg

Of which:
ppp represents the prediction result of the model, usually a binarized segmentation mask;
ggg represents the real segmentation result, which is also a binarized segmentation mask;
∣ ⋅ ∣ |\cdot| indicates the number of pixels in the corresponding area;
∩ \cap means intersection operation.

The value of Dice loss ranges from 0 to 1, and the larger the value, the higher the similarity between the model prediction result and the real segmentation result. Therefore, minimizing the Dice loss can make the model closer to the real segmentation results.

The comprehensive use of focal loss and Dice loss can simultaneously consider the difficulty of samples and segmentation accuracy in the training process to improve the performance and generalization ability of the model.

For the training promptable segmentation task, a strategy of mixing geometric hints is adopted. This means that we use a variety of geometric shape cues, such as points, boxes, etc., to increase the adaptability of the model to various scenes and objects. With diverse geometric cues, the model can learn the shape and structure information of different objects, improving the generalization ability and robustness of segmentation tasks.

Segment Anything Data Preparation

To collect the 1.1B mask dataset SA-1B, a data engine was built. The data engine consists of three stages:

  • Model-assisted manual labeling stage
  • Semi-automatic stage, combining automatically predicted masks and model-assisted annotation
  • Fully automatic stage

Model-assisted manual labeling stage (first stage)

At this stage, a professional annotation team was assembled to annotate the mask by clicking on foreground/background object points using SAM's browser-based interactive segmentation tool. Annotators annotate objects that they can name or describe.

At the beginning of this phase, we train the SAM using common public segmentation datasets. As more masks were collected, we scaled the image encoder from ViT-B to ViT-H and made other relevant detail adjustments. We retrained the model 6 times in total. 4.3M masks from 120k images were collected at this stage.

Semi-automatic stage (second stage)

To improve mask diversity, we focus on annotating less obvious objects at this stage. SAM first automatically detects the masks with high confidence, then populates these masks into the image, and asks the annotator to label any other unlabeled objects.

At this stage, we collected an additional 5.9M masks from 180k images (10.2M masks in total).

Fully automatic stage (third stage)

In the final fully automated stage, labeling is performed fully automatically. This benefits from two major improvements in our model. First, at the beginning of this stage, we have collected enough masks to greatly improve the performance of the model, including the diversification masks from the previous stage. Second, by this stage we have developed blur-aware models that can predict valid masks in the presence of blur.

Specifically, the model is prompted by a 32×32 regular grid of points, and for each point it predicts a set of masks that likely correspond to valid objects. Using a blur-aware model, if a point lies on a part or subpart, the model returns masks for subparts, parts, and the entire object. The IoU prediction module in the model is used to select codes with high confidence; moreover, a stable mask is selected (a mask is stable if the results are similar at probability image thresholds of 0.5-δ and 0.5+δ). Finally, after selecting a mask with high confidence and stability, we apply non-maximum suppression (NMS) to filter duplicate masks. To further improve the quality of smaller masks, multiple overlapping enlarged image regions are also processed.

Finally, we constructed the SA-1B dataset, which contains 11M images and a total of 1.1B high-quality masks.

insert image description here

Zero sample transfer experiment results

Performance from 23 downstream tasks
insert image description here
insert image description here

(a) Compared with RITM, SAM achieves higher results on 16 of 23 datasets.

Also provided are "oracle" results where instead of choosing the most confident mask, the most relevant mask is chosen by comparing the 3 masks of the SAM with the ground truth. This sheds light on the impact of ambiguity on automated evaluation.
In particular, SAM outperforms RITM on all datasets when using oracle to perform ambiguity resolution.

(b) SAM performs poorly on automatic metrics but achieves consistently high ratings in human studies.

Zero-shot transfer edge prediction

insert image description here

The SAM is hinted using a 16×16 regular grid of foreground points, yielding 768 predicted masks (3 per point). Redundant masks are removed by non-maximum suppression (NMS). Then, edge maps are computed using Sobel filtering.
Although SAM is not trained for edge detection, it is able to produce reasonable edge maps with high performance.

Zero-shot transfer object hint generation

insert image description here
ViTDet-H performed best. Under the zero-transfer condition, SAM performs well on several metrics.

Zero-shot transfer instance segmentation

insert image description here
The performance of SAM is comparable to that of ViTDet, although slightly inferior.

Mask Quality
By visualizing the output, it can be observed that SAM's masks tend to outperform ViTDet in quality. SAM consistently outperforms ViTDet in human studies.

Zero-shot transfer of text to masks

insert image description here

The CLIP image embeddings are first extracted. Then, during training, the SAM is prompted using the extracted CLIP image embeddings.
Since CLIP's image embeddings are trained aligned to their text embeddings, the text embeddings are used for inference, i.e. using the generated text embeddings as hints to the SAM.
SAM can segment objects based on simple text cues such as "a wheel" as well as phrases such as "beaver tooth grid".

insert image description here

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/131554088