[Image Segmentation] Comparison of large model SegmentAnything, FastSAM and MobileSAM

Insert image description here



Foreword: Segment Anything

Meta released the image segmentation model Segment Anything Model (SAM) this year. SAM has learned a general concept about objects and can generate masks for any objects in any image or video, even those not encountered during training. objects and image types. SAM is general enough to cover a wide range of use cases, with powerfulzero-shot migration capabilities.

SAM's "Split Everything" function implements object and area segmentation based on points, boxes, text and other types of instructions. The training adoptsend-to-end Transformer structure and is trained on tens of millions of supervision samples, which shows strong generalization in downstream tasks such as edge detection, object detection, salient object recognition, and industrial anomaly detection. As shown in the figure, SAM has three components: image encoder, flexible prompt encoder and fast mask decoder
Insert image description here

SAM due to its impressive zero-shot transmission performance and high versatility with compatibility with other models,< a i=3>Can be used for advanced vision applications, such as image editing with fine-grained control. Many segmentation models need to run on resource-constrained edge devices, such as mobile applications, which requires lightweight improvements to SAM.

具体原理和代码,可见博主的博客【【图像分割】Segment Anything(Meta AI)论文解读】


1. FastSAM

On June 22, 2023, the research team of the Institute of Automation, Chinese Academy of Sciences proposed the FastSAM method for the task of "splitting everything", rethinking the SAM design paradigm and designing "Full instance segmentation + instruction-based mask output” two-stage algorithm. FastSAM introduces artificial prior structure design through, and greatly reduces the cost of the original Transformer structure in this general perception task. Computing redundancy achieves 50 times acceleration, which is conducive to the implementation of industrial applications.

"Full instance segmentation + instruction-based mask output" two-stage algorithm, the method structure is shown in the figure.
Insert image description here

1. Basic principles

FastSAM carries out task and method collaboration method design at each stage:

(1) In the first stage, taking advantage of the fact that most objects in the image only occupy local areas of the image, adopts which naturally have local connection characteristics. The " convolution operator CNN is used as a backbone to construct a full instance segmentation network. This structure is more compact than the Transformer structure and has lower computational cost. But it still maintains the ability to represent and discriminate objects or image areas.

(2) In the second stage, the strategy of physical space matching and image-text alignment space matching is used for instruction-based mask output. Based on the full instance segmentation mask of the previous stage: for point prompt, associates the optimal point position Divide the area for output, support multi-point mode and background point suppression; for box prompt, < a i=7>Output the segmentation mask of the maximum IoU matched by the box; for text prompt, use Image and text alignment network CLIP maps the mask image area and text instructions to the same space, performs similarity calculation, and then outputs the most similar area.

Based on this method structure, FastSAM randomly selected 2% of the images for training on the open source SA-1B data set of the SAM team, and achieved results that match SAM, and is faster than the most commonly used 32×32 instruction version of SAM. It has been improved by 50 times and achieved real-time "dividing everything". Comparison is as follows:

Insert image description here

2. Comparison of experimental results

Insert image description here


2. MobileSAM

论文地址 : https://arxiv.org/pdf/2306.14289.pdf

Code address: https://github.com/ChaoningZhang/MobileSAM

论文题目

Insert image description here
The paper found that training a new lightweight SAM from the original SAM file will lead to unsatisfactory performance, which is mainly caused by the image encoder and mask decoding is caused by the coupling optimization of the device, so decoupled distillation is proposed. Specifically, extracts the knowledge of the image encoder ViT-H in the original SAM into a lightweight image encoder,. This encoder is automatically compatible with the mask decoder in the original SAM

Insert image description here
Comparison of SAM model parameters of different encoders:

Insert image description here

Training is completed on a single GPU in less than a day, and the resulting lightweight SAM is called MobileSAM. It is over 60 times smaller than the original SAM, but Performance is comparable to original SAM. In terms of inference speed, MobileSAM runs about 10ms per image: 8ms for the image encoder and 2ms for the mask decoder. With superior performance and greater versatility, our MobileSAM is 7x smaller and 4x faster than concurrent FastSAM, making it better suited for mobile applications.

1.Framework

SAM consists of a ViT-based image encoder and a hint-guided mask decoder. The image encoder takes an image as input and generates an embedding, which is then fed to the mask decoder. The mask decoder generates a mask that cuts out any object from the background based on cues such as points (or boxes). Additionally, SAM allows multiple masks to be generated for the same prompt to resolve ambiguities, which provides valuable flexibility. With this in mind, MobileSAM maintains the SAM pipeline, first employing a ViT-based encoder to generate image embeddings, and then employing a hint-guided decoder to generate the required masks . This pipeline is optimized for "segment anything" and can be used for downstream tasks of "SAM".

Coupled knowledge distillation of SAM. The left image represents fully coupled distillation, and the right image represents semi-coupled distillation.
Insert image description here

The goal of this project is to generate a mobile-friendly SAM (MobileSAM) and quickly implement SAM in a lightweight way. The hint-guidedmask decoder in the original SAM has less than 4M parameters and is considered lightweight. However, the default image encoder in the original SAM is based on ViT-H and has over 600M parameters, which is very heavyweight. Therefore, the key to achieving mobile-friendly SAM is to replace heavyweight image encoders with lightweight image encoders. Comparison of coupled distillation and decoupled distillation of SAM using ViT-B as the image encoder. Compared with coupled distillation, decoupled distillation performs better and requires less than 1% of computing resources.
Insert image description here

2. Experiment

  1. The figure below shows the comparison of the results of MobileSAM and native SAM under point and bbox prompt words

Insert image description here
Insert image description here

  1. The following figure compares the three models of SAM, FastSAM and MobileSAM from the perspective of segmentation. You can see:

MobileSAM aligns surprisingly well with native SAM results, while FastSAM produces some unsatisfactory results
FastSAM often produces non-smooth edges, while SAM and MobileSAM do not have this problem< /span>
Insert image description here
Insert image description here
Insert image description here

3.Code

The forward code is very simple. The pre-trained model is only 14M, which is included in the original code:

Insert image description here


Guess you like

Origin blog.csdn.net/qq_45752541/article/details/133708751