Article directory
论文:FASTER SEGMENT ANYTHING: TOWARDS LIGHTWEIGHT SAM FOR MOBILE APPLICATIONS
Code: https://github.com/ChaoningZhang/MobileSAM
Source: Kyung Hee University, South Korea
Time: 2023.06.27
1. Background
The SAM proposed by Meta has attracted widespread attention for its excellent ability to segment any target of interest. The structure of SAM is shown in Figure 1, including two parts:
- ViT-based image encoder
- prompt-guided mask decoder
SAM is a label-free segmentation model that can be combined with other models for further downstream tasks, such as text-guided segmentation, image editing, etc.
With the popularity of mobile devices, a lot of image editing is done on the mobile side, but the image encoder of SAM is very large, so it is urgent to design a mobile-friendly SAM.
Therefore, this paper proposes MobileSAM, mainly to design a lightweight SAM suitable for mobile devices.
2. Method
According to the general idea, since the image encoder is too large, then reduce the image encoder
For example, replace ViT-H with ViT-B, and the model parameters of different sizes of image encder are shown in Table 3:
Training a SAM from scratch (using ViT-L or ViT-B as an image encoder) needs to use 128 GPUs for several days of training, so the cost of retraining is also very high.
The author believes that the difficulty of optimization is that the image encoder and mask encoder are coupled together
Therefore, the author decouples the image encoder and mask encoder:
- First, distill knowledge from ViT-H to tiny ViT
- Then, finetune mask encoder to align the distilled small image encoder
Based on this, the task of designing a lightweight SAM is transformed into decoupling distillation, which is simple and efficient
MobileSAM reduces the parameters of the encoder by 100 times and the overall parameters by 60 times
MobileSAM inference speed:
- The single image reasoning speed is about 10ms (8ms is the image encoder, 2ms is the mask encoder)
Speed comparison between MobileSAM and FastSAM:
- MobileSAM is 7x smaller and 4x faster than FastSAM
2.1 Coupled distillation
An intuitive way to achieve a mobile-friendly SAM is to retrain a SAM with a small image encoder, but the training cost is too high, so the method of distillation can be considered, as shown on the left side of Figure 2, using the final large model The mask to guide the mask of the small model.
2.2 From semi-distillation to decoupled distillation
When directly using the mask to guide the distillation, the difficulty is that the image encoder and mask decoder are linked together, and the two are interdependent, so you can:
- image encoder: distillation
- mask encoder: finetuned (because the mask encoder in SAM is small, so keep the structure unchanged)
As shown on the right side of Figure 2, this is also called semi-coupled distillation (semi-coupled). The image encoder is distilled and the mask encoder parameters are frozen at the same time. Freezing can make the effect of the mask decoder unchanged and will not be affected by the image encoder. good and bad effects.
But this method still has problems, because the prompt is random, which will cause the mask decoder to be changeable and difficult to optimize
Therefore, the distillation method in this article:
- Fully decoupled distillation
- Distill image embedding directly
- Completely decouple image encoder and mask decoder
- In this way, MSE loss can also be used directly, without combining focal and dice loss
The computing resource of the decoupled distillation method is 1% of the coupled distillation method, but it reaches 0.75 mIoU:
3. Effect