SAM (Segment Anything Model) for image segmentation

Thesis: Segment Anything

Github:https://github.com/facebookresearch/segment-anything

Starting from the basis of the zero-shot backbone network, the paper proposes a SAM (Segment Anything Model) model. This model is different from traditional segmentation models. The traditional segmentation model can only input the original image and output a fixed segmentation result. The design of SAM can input the original image and specific prompts (points, boxes, shadows, texts) at the same time, and then output different segmentation results according to different prompts. And SAM Interactive segmentation with different prompts is supported. SAM can be applied to a variety of segmentation scenarios, including interactive segmentation, boundary detection, super-resolution, object generation, foreground segmentation, semantic segmentation, instance segmentation, panoramic segmentation and many other scenarios. In addition, in order to train a multi-modal model such as SAM, the paper has also worked hard on the data. The paper contributed a large-scale segmentation data set SA-1B, including 1 billion masks and 110w pictures.

To enable the model to have zero-sample migration capability, efforts need to be made in three aspects: model capacity, data set size, and overall training.

Therefore, the paper made improvements in three aspects: task, model, and data.

Task

In the interactive segmentation task of the paper, some hints, such as point hints, are ambiguous. In order to solve such problems, the SAM model will output three segmentation results at the same time, namely whole (whole), part (part), and local (subpart).

model

The paper designs a SAM segmentation model containing multi-modal information, and the model predicts that the segmentation mask will take about 50ms under relevant prompts.

The SAM model consists of three parts: image encoder module (image encoder), prompt information encoder module (prompt encoder), and segmentation mask decoder module (mask decoder).

The image encoder module is implemented based on the Vision Transformer (ViT) backbone network.

The hint information encoder module supports sparse feature points, boxes, text (points, boxes, text) and dense feature shadows (masks).

For points and boxes, the embedding of the position encoding information is added when the embedding is extracted, and then the two are added to obtain the final embedding. For the encoding of text information, the clip model is adopted.

The coding of shadow information is realized by conv, and finally added to the coding features of the picture.

The segmented mask decoder module is implemented by the decoder part of Transformer, and the dynamic head prediction module is connected later. At the same time, during the training process, self-attention and cross-attention operations are introduced from pictures to hints and from hints to pictures. Finally, the embedding of the upsampled image passes through an MLP module and a linear classifier to obtain the final probability map.

Data engine

In order to build a large-scale segmentation data set, the paper builds a set of data production engine. The whole process includes 3 stages, manual stage (assisted-manual), semi-automatic stage (semi-automatic), fully automatic stage (fully automatic).

Manual phase ( assisted-manual ):

At this stage, the labeling process and the training process are carried out synchronously. It is necessary to manually label the mask on the labeling engine. If the labeling time of a mask exceeds 30s, it will give up choosing to label the next picture. With the increase of labeled pictures and the training process, the backbone network of the picture encoder evolves from ViT-B to ViT-H. This annotation-training iterative process lasted for 6 rounds. In the end, the labeling time of each picture was reduced from 34s to 14s. The number of masks in each image is increased from 20 to 44. Finally, 4.3M masks and 120k images were collected at this stage.

Semi-automatic stage ( semi-automatic ):

The main purpose of this stage is to increase the diversity of the mask, thereby improving the segmentation ability of the model. Since the labeling process pays more attention to the labeling of mask diversity, the average labeling time at this stage is increased to 34s/picture. The number of masks in each image has been increased from 44 to 72. 5.9M masks and 180k images were collected at this stage.

Fully automatic stage ( fully automatic ):

This stage uses the model for fully automatic labeling. Each image will be set with 32*32 grid points to cover all objects in the image. And it will select a reliable mask through iou, and then perform NMS operation to get the final mask.

A total of 1.1B masks and 11M images were collected at this stage.

Losses and training

The training process uses a linear combination of focal loss and dice loss as the final loss.

During the training process, different prompts are randomly sampled according to the mask for training.

Dataset

Large-scale segmentation data set SA-1B, including 1 billion masks and 110w pictures. Compared with similar segmentation data sets, there are 400 times more masks.

Images

The resolution of the original picture is 3300×4950. Considering the necessity of storage and display, the picture is scaled to 1500 pixels on the short side. Even so, it is much larger than the image resolution of 480×640 in the coco dataset.

Masks

The pictures marked by the data engine have a high quality of marking.

Mask quality

From all the data, 500 pictures and corresponding about 50000masks were randomly selected. Let experts carry out fine labeling, and then compare the IOU with the results marked by the data engine. The result is that 94% of the images have an IOU exceeding 90%, and 97% of the images have an IOU exceeding 75%. The iou consistency is basically 85-91%.

Mask properties

SA-1B covers a wider range of pictures, with 11 times more pictures and 400 times more masks than the second largest segmentation dataset. Meanwhile SA-1B contains more small and medium masks. The diversity of the mask is analyzed by the degree of depression of the mask contour, and it is found that SA-1B and other segmentation data sets have the same mask diversity.

RAI Analysis

Through Responsible AI (RAI) analysis, SA-1B covers the data picture from every corner of the world. At the same time, there is no discrimination in age, gender, skin color and other dimensions. Demonstrated more inclusive More Inclusive Annotations for People (MIAP).

 

Experimental results:

On many datasets, the SAM method outperforms the RITM method.

As the number of cue points increases, the segmentation effect of SAM becomes higher and higher, and as the cue points increase from 1 to 9, the gap between the SAM method and other segmentation methods becomes smaller and smaller. When the number of prompt points reaches 9, the segmentation effect of SAM will be slightly lower than that of other methods, because the original intention of the design of SAM method is not a high IOU segmentation method.

The segmentation effect of the SAM method on medium-sized objects, large objects, rare objects, and normal objects is superior to other segmentation methods.

 

SAM can achieve text-based segmentation. At the same time, in the case of inaccurate text segmentation, the segmentation effect can be improved by adding point prompt information.

Limitations

SAM performs poorly on fine-grained segmentation and segmentation of non-sequential parts. At the same time, the boundary of the segmentation is not clean enough.

The text-to-mask segmentation task is still in the exploration stage, it is not robust enough, and there is a lot of room for improvement.

in conclusion:

SAM is the first to propose the concept of foundation models for 0-sample transfer in the field of image segmentation. That is, the model can directly perform segmentation reasoning without any training in actual usage scenarios. The paper contributed the SAM segmentation model and the SA-1B segmentation dataset.

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_14845119/article/details/130628575