Paper Translation: Segment Anything

Paper address: https://arxiv.org/abs/2304.02643
Code address: https://github.com/facebookresearch/segment-anything
Dataset address: https://ai.facebook.com/datasets/segment-anything/
insert image description here

The "Segment Anything" project aims to democratize image segmentation in computer vision by introducing new tasks, datasets, and models. The project includes the generic "Segment Anything Model" (SAM) and the largest segmentation dataset "Segment Anything 1-Billion mask" (SA-1B). The model is promptable and enables zero-shot transfer on new tasks and image distributions. Furthermore, it exhibits impressive performance, often surpassing fully supervised methods. The SA-1B dataset is available for research purposes, while SAM is open sourced under the Apache 2.0 license. This initiative aims to enable broad applications and encourage further research on fundamental models for computer vision.

Task

insert image description here

Based on hinting techniques, foundational models in the fields of natural language processing (NLP) and computer vision enable zero-shot and few-shot learning on new datasets and tasks. The hintable segmentation task aims to generate effective segmentation masks given a hint, which can contain spatial or textual information, for object recognition.

In segmentation tasks, hints can be points, boxes, masks, text, or any information that indicates the object to be segmented in the image. The hintable segmentation task requires the generation of efficient segmentation masks for any given hint. Even if the prompt is ambiguous and may refer to multiple objects, the output should provide a plausible mask for at least one object, similar to a language model's coherent answer to an ambiguous prompt. This task enables natural pre-training algorithms through hinting techniques and provides a general method for zero-shot transfer to downstream segmentation tasks.

The cueable segmentation task provides a natural pre-training algorithm that simulates a series of cue sequences for each training sample and compares the model's mask predictions with the ground-truth results. This task takes inspiration from interactive segmentation, where the goal is to predict effective masks when the cues are ambiguous. This ensures the effectiveness of pre-trained models in use cases involving ambiguity, such as automatic annotation.

Model

insert image description here
insert image description here
insert image description here
SAM has three components: an image encoder (ViT pre-trained with MAE), a flexible hint encoder, and a fast mask decoder.

The authors consider two sets of cues for segmentation: sparse cues (points, boxes, text) and dense cues (masks). Points and boxes are represented using positional encodings combined with learned embeddings for each cue type, while free text uses a text encoder from CLIP. Dense cues (like masks) are embedded using convolutions and summed element-wise with the image embedding.

A mask decoder efficiently maps image and cue embeddings and output tokens to a mask. Inspired by previous research, it uses a modified Transformer decoder block followed by a dynamic mask prediction header. The decoder block updates all embeddings using cue self-attention and cross-attention in both directions (cue-to-image and image-to-cue). After two blocks, the image embedding is upsampled, and the MLP maps the output tokens to a dynamic linear classifier that computes the masked foreground probability for each image location.

The model was modified to predict multiple output masks for a single ambiguous cue, where three mask outputs were found to be sufficient for most common cases. During training, a minimum loss is used for the masks while predicting a confidence score for each mask. Mask prediction is supervised using a linear combination of focal loss and Dice loss. With a mixture of geometric cues and an interactive setting of 11 rounds per mask, training prompts the segmentation task, enabling seamless integration with data engines.

data engine

Due to the lack of a large amount of segmentation mask data on the Internet, the authors built a data engine to collect the SA-1B dataset containing 1.1 billion masks. The data engine consists of three stages:

Auxiliary manual phase. In the first stage, professional annotators use an interactive browser-based segmentation tool (powered by SAM) to label masks by clicking on foreground and background object points. Model-assisted annotation runs in real-time, providing an interactive experience. Annotators label objects without semantic constraints and preferentially label salient objects. The SAM is initially trained on a publicly available segmentation dataset and retrained using the newly labeled masks for a total of six retraining iterations. With the improvement of SAM, the average annotation time per mask is reduced from 34 seconds to 14 seconds, and the average number of masks per image is increased from 20 to 44. At this stage, 4.3 million masks were collected from 120,000 images.

In the semi-automatic stage, the aim is to increase the diversity of masks and improve the segmentation ability of the model. Confident masks are automatically detected and presented to an annotator along with a pre-populated image, who then labels additional unlabeled objects. Using the generic "object" category, a bounding box detector was trained using the masks from the first stage. This stage collected 5.9 million additional masks from 180,000 images, for a total of 10.2 million masks. The model is periodically retrained (5 times) with newly collected data. For more challenging objects, the average annotation time per mask is increased to 34 seconds, and the average number of masks per image is increased from 44 to 72, including automatically generated masks.

In the fully automatic stage, annotation is fully automated as more masks are collected and a model with blur awareness is developed. The model makes predictions through a 32x32 grid of points and predicts a set of masks for valid objects. Use an IoU prediction module to select plausible masks and only select stable masks. Applies non-maximum suppression (NMS) to repeated masks and handles overlapping scaled image crops to improve the quality of smaller masks. Fully automatic mask generation was applied to all 11 million images, resulting in a total of 1.1 billion high-quality masks.

data set

insert image description here
SA-1B is a dataset containing 11 million diverse, high-resolution, licensed, privacy-preserving images, and 1.1 billion high-quality segmentation masks. These images are of higher resolution (average 3300×4950 pixels, scaled down to 1500 pixels) compared to existing datasets, where faces and license plates have been blurred. 99.1% of the segmentation masks are generated fully automatically, and their quality is a key concern. A comparison of automatically predicted and professionally corrected masks shows that 94% of the pairs have an IoU (Intersection-over-Union Ratio) over 90%. This high-quality mask is also confirmed by human evaluation. The spatial distribution of object centers in SA-1B has wider coverage in terms of image corners, and SA-1B has more images, masks, and number of masks per image than other datasets. It also incorporates a higher proportion of small and medium relative sized masks. The shape complexity of the masks in SA-1B is roughly similar to other datasets.

performance analysis

SA-1B has a higher proportion of images in Europe, Asia and Oceania, and middle-income countries, but is underrepresented in terms of Africa and low-income countries. However, there are at least 28 million masks in all areas of SA-1B. The dataset was analyzed for fairness in terms of perceived gender presentation, perceived age group, and perceived skin tone. The results showed that SAM performed similarly across gender and age groups and did not differ significantly across perceived skin color groups. However, bias may exist when SAMs are used as components of larger systems, and signs of bias in terms of perceived gender presentation were found in segmented clothing.

insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/131401210