[AIGC] 18. MobileSAM | The first faster SAM specially designed for mobile terminals

insert image description here

论文:FASTER SEGMENT ANYTHING: TOWARDS LIGHTWEIGHT SAM FOR MOBILE APPLICATIONS

Code: https://github.com/ChaoningZhang/MobileSAM

Source: Kyung Hee University, South Korea

Time: 2023.06.27

1. Background

The SAM proposed by Meta has attracted widespread attention for its excellent ability to segment any target of interest. The structure of SAM is shown in Figure 1, including two parts:

  • ViT-based image encoder
  • prompt-guided mask decoder

SAM is a label-free segmentation model that can be combined with other models for further downstream tasks, such as text-guided segmentation, image editing, etc.

insert image description here

With the popularity of mobile devices, a lot of image editing is done on the mobile side, but the image encoder of SAM is very large, so it is urgent to design a mobile-friendly SAM.

Therefore, this paper proposes MobileSAM, mainly to design a lightweight SAM suitable for mobile devices.

insert image description here

2. Method

According to the general idea, since the image encoder is too large, then reduce the image encoder

For example, replace ViT-H with ViT-B, and the model parameters of different sizes of image encder are shown in Table 3:

insert image description here

Training a SAM from scratch (using ViT-L or ViT-B as an image encoder) needs to use 128 GPUs for several days of training, so the cost of retraining is also very high.

The author believes that the difficulty of optimization is that the image encoder and mask encoder are coupled together

Therefore, the author decouples the image encoder and mask encoder:

  • First, distill knowledge from ViT-H to tiny ViT
  • Then, finetune mask encoder to align the distilled small image encoder

Based on this, the task of designing a lightweight SAM is transformed into decoupling distillation, which is simple and efficient

MobileSAM reduces the parameters of the encoder by 100 times and the overall parameters by 60 times

MobileSAM inference speed:

  • The single image reasoning speed is about 10ms (8ms is the image encoder, 2ms is the mask encoder)

Speed ​​comparison between MobileSAM and FastSAM:

  • MobileSAM is 7x smaller and 4x faster than FastSAM

2.1 Coupled distillation

An intuitive way to achieve a mobile-friendly SAM is to retrain a SAM with a small image encoder, but the training cost is too high, so the method of distillation can be considered, as shown on the left side of Figure 2, using the final large model The mask to guide the mask of the small model.

insert image description here

2.2 From semi-distillation to decoupled distillation

When directly using the mask to guide the distillation, the difficulty is that the image encoder and mask decoder are linked together, and the two are interdependent, so you can:

  • image encoder: distillation
  • mask encoder: finetuned (because the mask encoder in SAM is small, so keep the structure unchanged)

As shown on the right side of Figure 2, this is also called semi-coupled distillation (semi-coupled). The image encoder is distilled and the mask encoder parameters are frozen at the same time. Freezing can make the effect of the mask decoder unchanged and will not be affected by the image encoder. good and bad effects.

But this method still has problems, because the prompt is random, which will cause the mask decoder to be changeable and difficult to optimize

Therefore, the distillation method in this article:

  • Fully decoupled distillation
  • Distill image embedding directly
  • Completely decouple image encoder and mask decoder
  • In this way, MSE loss can also be used directly, without combining focal and dice loss

insert image description here

The computing resource of the decoupled distillation method is 1% of the coupled distillation method, but it reaches 0.75 mIoU:

insert image description here

3. Effect

insert image description here

insert image description here

insert image description here

insert image description here
insert image description here

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/jiaoyangwm/article/details/131425692