Interpretation of the paper: MobileSAM | FASTER SEGMENT ANYTHING: TOWARDS LIGHTWEIGHT SAM FOR MOBILE APPLICATIONS

Date of publication: 2023.06.27
Paper address: https://arxiv.org/pdf/2306.14289.pdf
Project code: https://github.com/ChaoningZhang/MobileSAM

Segment anything model (SAM) is a fast-guided vision-based model for finding objects of interest from their backgrounds. Since the Meta research team released the SA project, SAM has attracted widespread attention due to its impressive zero-shot performance and compatibility with other models, such as fine-grained control of image editing. Many of these use cases need to run on resource-constrained edge devices, such as mobile applications. In this work, we aim to make SAM mobile-friendly to replace lightweight image encoders. Approaches to training new SAMs like those in the original SAM paper lead to unsatisfactory performance, especially when training sources are limited. We found that this is mainly caused by the coupled optimization of the image encoder and mask decoder, so we propose a decoupled distillation method. Specifically, we extract the image encoder ViT-H in the original SAM into a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. Training can be completed in less than a day, and the resulting lightweight SAM, dubbed MobileSAM, is more than 60 times smaller but has comparable performance to the original SAM. To improve inference speed, MobileSAM runs about 10 ms per image: 8 ms on the image encoder and 2 ms on the mask decoder. Due to superior performance and higher generality, our MobileSAM is 7 times smaller and 4 times faster than the concurrent FastSAM, making it more suitable for mobile applications.

key interpretation

basic background

SAM is similar to the chatgpt model in the visual world, which provides the basic capabilities of image encoding. It is a large model in the visual world and can be applied to various image processing tasks; SAM can run in lightweight by encoding image features on the server side
. The web side, but its image encoder its too huge.

highlights

1. Use the decoupling distillation method (only the image encoder is distilled) to adapt the backbone to the original decoder. The entire training takes less than a day on a GPU, reducing the encoder parameters by 100 times and the total parameters by 60 times. .
2. The distilled image encoder runs at 8 ms, the mask decoder runs at 2 ms, and the overall running time is 10 ms, which is 4 times faster than FastSAM.
3. It designs a lightweight image encoder based on conv and transformer; at the same time, in order to speed up training, the feature encoding predicted by the teacher model is saved, which reduces the time for the teacher model to forward in knowledge distillation.
4. Only 1% of the SA-1B dataset is used for training, while FastSAM uses 2% of the data.
5. In the ablation experiment in Section 4.2, it was proved that the larger the batch size in the massive data set, the better the effect; at the same time, the epoch numbers of the massive data are basically all digits.

MobileSAM本质就是对SAM中ViT模型的知识蒸馏,使用了原来SAM中的mask解码器,然后在训练数据上只使用了原来的1%,故才可实现在一天内训练好模型

1 Introduction

ChatGPT has revolutionized the field of NLP, marking a breakthrough in Generative Artificial Intelligence (AIGC, Artificial Intelligence Generated Content). The base model trained by Bommasani et al. [2021] on a web-scale text dataset constitutes the GPT family of models. With the success of NLP basic models, most scholars trained image encoders and text encoders by contrastive learning. Recently, the Meta research team released the "Segment Anything" project, which proposes an instant-guided vision-based model named SAM, and is considered the GPT moment of vision. SAM consists of two components: a vit-based image encoder and a hint-guided mask decoder, which work sequentially (see Figure 1).

Since its appearance, SAM has attracted widespread attention for several reasons. First, it is the first to show that vision, like NLP, can pursue a path that combines basic models with rapid engineering. Second, it is the first to perform label-free segmentation, a fundamental vision task parallel to label prediction. Furthermore, this fundamental task makes SAM compatible with other models, enabling advanced vision applications such as text-guided segmentation, image editing with fine-grained control. However, many of these use cases need to run on resource-constrained edge devices, such as mobile applications. As shown in the official demo, with image embedding processed on the server side, SAM can work on resource-constrained devices because the mask decoder is lightweight. What makes the SAM pipeline computationally expensive is this huge image encoder. In this work, we investigate how to obtain a lightweight SAM suitable for resource-constrained mobile devices, hence it is called MobileSAM.

Given that the default image encoder is based on ViT-H, directly obtaining MobileSAM follows the official pipeline in Kirillov et al. [2023] and retrains a new SAM with a smaller image encoder, replacing ViT-H's smaller ViT-L and even the smaller ViT-B. Table 1 summarizes the SAM parameters of the image encoder SAM at different scales.

As described by Kirillov et al. [2023], training a new SAM using ViT-L or ViT-B as an image encoder requires 128 GPUs. This resource-intensive retraining can be a significant burden to reproduce or improve their results. This optimization difficulty mainly comes from the coupled optimization of the image encoder and mask decoder. Based on this understanding, we propose an optimization to decouple the image encoder and mask decoder. Specifically, we first extract knowledge from the default image encoder ViT-H to a tiny ViT. We can then fine-tune the mask decoder in the original SAM to better align with the distilled image encoder. It is worth emphasizing that the alignment optimization is optional, since the lightweight image encoder is extracted from the default image encoder, which guarantees its intrinsic alignment with the default mask decoder.

By transforming the problem of finding new SAM pipelines into a decoupled distillation, our method has the advantages of simplicity and efficiency, while being reproducible at low cost (less than a day on a single GPU). The resulting MobileSAM reduces encoder parameters by a factor of 100 and total parameters by a factor of 60. Surprisingly, the performance of this lightweight MobileSAM is comparable to that of the original heavyweight SAM, which is an important step towards advancing SAMs for mobile applications. For MobileSAM inference, a single image only runs in about 10 ms: 8 ms on the image encoder and 2 ms on the mask decoder. It is worth emphasizing that our MobileSAM is 7x smaller and 4x faster than the concurrent FastSAM Zhao et al. [2023] while achieving superior performance.

2 Related work

SAM: generalization and versatility. Since its appearance in early April this year, many projects and papers have emerged investigating SAM from different perspectives. Given that SAM claims to be able to segment anything, a series of works report its performance in the real world, including medical images, camouflaged objects, and transparent objects. Findings consistently show that SAMs work well in general settings, but are ineffective in the aforementioned challenging tasks. Another important research direction focuses on improving SAMs to increase their practicality. Attack-SAM has shown that the output mask of SAM can be easily manipulated by maliciously generated adversarial attacks. Qiao et al. [2023b] further performed a comprehensive robustness evaluation of SAMs, from style transfer and common erosion to local occlusions and adversarial perturbations. Qiao et al. [2023b] found that SAM has high robustness but no effect on adversarial perturbations, which is consistent with the findings of Zhang et al. [2023e]. Another line of work has focused on demonstrating the versatility of SAMs. Grounded SAM IDEA [2023] is a seminal work combining Grounding DINO Liu et al. [2023a] with SAM for segmentation of any text input. Specifically, it relies on Grounding DINO to generate a bounding box from text, and then the generated bounding box can be used as a hint for the segmentation mask. The mask predicted by SAM has no category information and multiple works, Chen et al. [2023], Park [2023] combined SAM with other CLIP-like models for semantic segmentation of anything. In addition to object segmentation, multiple works have also shown its versatility in other fields, including image editing Rombach et al. [2022], and Yu et al. [2023], video object tracking Yang et al. [2023], Zxyang [2023] ]. In addition to 2D vision, SAM research has also been extended to 3D object reconstruction Shen et al. [2023], Kang et al. [2022], demonstrating its ability to assist in the generation of 3D models from a single image. For a complete survey of SAMs, proposal readers are referred to Zhang et al. [2023c].
There are many studies on SAM, which is worthy of a new generation of basic visual large model

ViT: lightweight and efficient Early mobile vision applications were mainly driven by lightweight CNNs, such as MobileNet Howard et al. [2017] and its improved variants, Sandler et al. [2018], Howard et al. [2019]. The core idea of ​​MobileNet is to divide a regular convolution block into depthwise convolution and point convolution, which greatly reduces mode parameters and computation time. Since the appearance of VIT, many researchers have tried to make it lightweight and efficient. Based on the original ViT paper, Touvron et al. [2020] introduced minor changes, proposing Deit-Small (Deit-S) and Deit-Tiny (Deit-T) ViT-Small and ViT-Tiny. MobileViT [2021] is a seminal work that combines ViT and standard convolutions to improve its performance, which outperforms MobileNet v2. The main motivation is to exploit the local representation capabilities of CNNs, an exercise followed by multiple follow-up works aimed at improving model speed. Recent advances in lightweight and faster ViTs are complementary to our proposed decoupled distillation to make next-generation SAMs suitable for resource-constrained mobile devices.
Lightweighting of the ViT model is proposed

3 Mobile-Friendly SAM

3.1 Background and Project Goal

Background on SAM. Here, we first summarize the structure of SAM and how it works. SAM consists of a vit-based image encoder and a hint-guided mask decoder. An image encoder takes an image as input and generates an embedding, which is then fed into a mask decoder. The mask decoder generates a mask that removes any objects from the background based on cues such as points (or boxes). Furthermore, SAM allows generating multiple masks for the same cue to resolve ambiguity, which provides valuable flexibility. With this in mind, this work maintains a SAM pipeline that first employs a vit-based encoder to generate image embeddings, and then employs hints to guide the decoder to generate the desired mask. This pipeline is optimized for "segment anything", which can be used for downstream tasks of "segment anything" (see Section 4.3 for more discussion.) Project goal. The goal of this project is to
generate a mobile-friendly SAM (MobileSAM ), which achieves satisfactory performance in a lightweight manner and is much faster than the original SAM. The hints in the original SAM guide the mask decoder with less than 4M parameters and are thus considered lightweight. Given an image embedding processed by an encoder, as shown in their public demonstration, SAM can work in resource-constrained devices because the mask decoder is lightweight. However, the default image encoder in the original SAM is based on ViT-H with parameters exceeding 600M, which is very heavyweight, making the whole SAM pipeline incompatible with mobile devices. Therefore, the key to getting a mobile-friendly SAM is to replace it with a lightweight image encoder, which also automatically retains all the functions and features of its original SAM. In the following, we elaborate on our proposed approach to achieve the goals of this project.
Discusses the necessity of ViT lightweight in SMA

3.2 Proposed Method

Coupled distillation. A straightforward way to achieve our project goal is to follow the official pipeline of Kirillov et al. [2023] and retrain a new SAM with a smaller image encoder. As described by Kirillov et al. [2023], training a SAM with the ViT-H image encoder takes 68 hours on 256 A100 GPUs. Replacing ViT-H with ViT-L or ViT-B reduces the required GPUs to 128, which remains a significant burden for many researchers in the community to reproduce or improve their results. According to their method, we can further adopt a smaller image encoder and retrain a new SAM using their provided 11-T segmentation dataset. Note that the masks in the provided dataset are given by a pre-trained SAM (with ViT image encoder). Essentially, this training process is also knowledge distillation, which transfers knowledge from a vit-h based SAM to a SAM with a smaller image encoder (see left Fig. 2).

From semi-coupled to decoupled distillation. When performing KD from raw SAM to using a smaller image encoder, the difficulty lies mainly in the coupled optimization of the image encoder and combined decoder. Intuitively, the optimization of an image encoder depends on the quality of an image decoder, and vice versa. When both modules in the SAM are in a bad state, it is more challenging to train them both to a good state. Inspired by Zhang et al. [2022c], we propose to divide the KD task into two subtasks: image encoder distillation and mask decoder fine-tuning. Specifically, we first perform KD on an image encoder by transferring knowledge to a smaller encoder. Since the mask decoder in the original SAM is already lightweight, we plan to keep its architecture. This brings the benefit of an easy-to-use combined decoder for fine-tuning, rather than training it from scratch. To alleviate the optimization problem of coupled distillation, a simple approach is to use a replicated and frozen mask decoder to optimize the image encoder (see Figure 2 on the right).
Freezing operations can help prevent the quality of mask decoders from being deteriorated by poor image encoders. We call this distillation semi-coupling because the image encoder optimization is still not fully decoupled from the mask decoder. Empirically, we find that this optimization is still challenging because the choice of prompts is random, making the mask decoder a variable that increases the difficulty of optimization. Therefore, we propose to extract the small image encoder directly from ViT-H in the original SAM, without employing a combined decoder, called decoupled distillation (see Figure 3). Another advantage of performing distillation on image embeddings is that we can employ a simple MSE loss instead of combining the focal loss Lin et al. [2017] and the dice loss Miletari et al. [2016] to predict.

On the necessity of mask decoder finetuning. Unlike semi-coupled distillation, the above decoupled distillation produces a lightweight image encoder that may not align well with the original frozen mask decoder. Empirically, we find that this is not true, since the image encodings generated from the student image encoder can be close enough to those of the original teacher encoder that fine-tuning the combined decoder in the second stage is optional. It is expected that fine-tuning or joint fine-tuning of the mask decoder on the frozen lightweight image encoder can further improve the performance.

Preliminary evaluation. Here, we conduct a preliminary study to compare coupled and decoupled distillation. Here, for performance evaluation, we compute the mIoU between two masks generated by the teacher SAM and the student SAM at the same cue point. Intuitively, assuming that the mask generated by ViT-H is GT, higher mIoU indicates higher mask prediction performance. For coupled distillation, we adopted the original SAM provided in Kirillov et al. [2023] SAM and ViT-B. Trained on SA-1B (11M images) on 128 GPUs (1 sample per GPU) for 180k iterations. In contrast, in our decoupled distillation setup, we train the model on 2 GPUs (2 samples per GPU to save computational resources) and 0.1% of the SA-1B dataset (11k) image samples for 55k iterations. Overall, decoupled distillation requires less than 1% of computational resources than coupled distillation, with mIOUs of 0.75 vs 0.72 (average of 200 samples), respectively. Since ViT-B is still an important burden for mobile devices, we experimented below using tiny Vit based on our proposed decoupling distillation (Wu et al. (2022]. To demonstrate the necessity of decoupling distillation, which

can Great savings in training time

4 Experiments

4.1 Experimental Setup

Lightweight Image Encoder. The goal of our project is to obtain an efficient SAM by replacing the default ViT-H with a lightweight image encoder for mobile devices. As a vit-based backbone, ViT-Tiny has similar parameters to Deit-Tiny but with better performance. For example, on ImageNet-1K, Deit-Yiny achieves 72.2% accuracy, while ViT-Tiny achieves 79.1% accuracy. Therefore, we employ ViT-Tiny as a proof of concept to demonstrate the effectiveness of our proposed decoupled distillation to train a lightweight MobileSAM SAM that can be much faster than the original SAM. The adopted lightweight image encoder consists of four stages, gradually reducing the resolution. The first stage consists of conv and inverted residual structures, while the remaining three stages consist of transformer blocks. At the beginning of the model, there are 2 convs with a stride of 2 for downsampling resolution. The downsampling operation between different stages is processed by a convolution block with a stride of 2. Unlike Wu et al. [2022], we set the stride of upsampled convolution to 2, so that the final resolution matches the original SAM's ViT-H image encoder. Note that other efficient image encoders discussed in Section 2 can also be adopted as image encoders.

Training and evaluation details. For decoupled KD on image encoders, we use the 1% SA-1B dataset Kirillov et al. [2023] to train a lightweight encoder on a single GPU (RTX3090). We observe that more computation is spent on the forward pass of the teacher image encoder, given that it is significantly heavier than the student image encoder we employ (see above). To speed up distillation, we follow the practice of Wu et al. [2022] and pre-save image embeddings so that we only need to run the forward pass once. Using one GPU, we can get MobileSAM in less than a day. Training our MobileSAM with GPUs for longer periods of time will hopefully yield better performance. A preliminary study performing mask decoder fine-tuning further improves the performance of MobileSAM, however, we omit this step in this version of the paper for the sake of simplicity. To quantitatively evaluate the distilled SAM, we compute the mIoU between the mask predicted by the original SAM and our MobileSAM.
A lightweight image encoder is designed based on conv and transformer; at the same time, in order to speed up training, the feature encoding predicted by the teacher model is saved

4.2 MobileSAM performs on par with the orignal SAM

For the main results, we report predicted masks with two types of hints: points and bboxes. We do not use text prompts to report results because the official github project of SAM does not provide pre-trained models for text-guided mask decoders. Figure 4 shows the result with points as prompts,

and Figure 5 shows the results with boxes as prompts. We observe that MobileSAM makes satisfactory mask predictions similar to the original SAM.

Ablation study. Here we conduct an ablation study of the impact of training computation on SAM performance. The results in Table 4 show that with the same number of iterations, increasing the batch size can improve the performance of the model. Also, under batch size, the performance is also improved by increasing training epochs for more update iterations. Note that all experiments are performed on a single GPU. We expect that increasing the number of GPUs to allow larger batch sizes or further increasing iterations can further improve performance.

4.3 MobileSAM outperforms FastSAM in All Aspects

Segment anything vs segment everything. Note that the original SAM paper Kirillov et al. [2023] is titled "Segment anything", not "Segment everything". As highlighted by Kirillov et al. [2023], SAMs perform fast segmentation tasks, "returning valid segmentation masks based on input prompts". The role of the prompt is to specify what to segment in the image. In theory, any object can be split as long as the prompt is set correctly, hence, it is called "splitting anything". In contrast, "split everything" is essentially object proposal generation, so no hints are needed.

In summary, "segment anything" solves the fundamental task of fast segmentation of any object, while "segment everything" solves the downstream task of mask proposal generation for all objects. Since "splitting everything" does not necessarily require prompting for input, FastSAM will generate mask proposals directly with YOLO v8 in a silent manner. To achieve promptable segmentation, a mapping algorithm is designed to select a mask from the set of proposed masks. It is worth emphasizing that follow-up work evaluating its generalization/robustness or studying its versatility mainly focuses on any modality, not all modality, since the former solves the underlying task. Thus, the comparison with FastSAM focuses primarily on "anything that splits", but for completeness we also provide a comparison on "everything that splits".

MobileSAM is faster and smaller. FastSAM includes a YOLOV8-based detection branch and a YOLACT-based segmentation branch to perform silent mask proposal generation. It has 68 M parameters and takes 40 ms to process one image. In comparison, MobileSAM has 10M fewer parameters, which is significantly smaller. In terms of inference speed, on one GPU, it takes 40 ms to process an image, while our image only needs 10 ms, which is 4 times faster than FastSAM.

mIoU comparison under segment anything mode. We further compare the mIoU between the predicted mask and the original SAM. Note that FastSAM cannot use a single point as the original SAM to predict the mask. Instead, it requires at least two cue points: one for the foreground and one for the background. The results in Table 6 show that the mIoU of FastSAM is much smaller than that of MobileSAM, indicating that the mask prediction of FastSAM is quite different from the original SAM. Furthermore, the mIoU of FastSAM decreases very fast when the distance between two cue points increases. This is mainly due to the fact that FastSAM often fails to predict objects when the foreground cue points are set too close to the background cue points.

Results for segment everything.
The result of "segment everything" is shown in Figure 6. For completeness, we also report the results of the original SAM, which generated a delightful object proposal. We have two main observations. First, the results of our MobileSAM agree surprisingly well with those of the original SAM. In contrast, the results of FastSAM are often not satisfactory. For example, FastSAM often fails to predict some objects, such as the roof in the first image. Furthermore, the mask proposal is sometimes difficult to interpret (see the mask of the stage in the first image and the sky in the second image). Second, FastSAM often generates masks with non-smooth boundaries, for which we recommend readers to zoom in on to see details in Figure 6. For example, the pillars in the third image have non-smooth boundaries, while the original SAM and our MobileSAM do not have this problem.

5 Conclusion

In this work, we aim to make SAM mobile-friendly to replace lightweight image encoders. We find that a naive approach to training such a new SAM, as in the original SAM paper, leads to unsatisfactory performance, especially in settings where the training source is limited. The coupled optimization of the image encoder and mask decoder is the reason, so we propose decoupled distillation to extract knowledge from the image encoder ViT-H in the original SAM to a lightweight image encoder. We show that the resulting lightweight image encoder is automatically compatible with the original mask decoder. Our MobileSAM is more than 60 times smaller than the original, but has comparable performance to the original SAM. In addition, we also compare with the concurrent FastSAM, and the results show that MobileSAM achieves superior performance. Our MobileSAM is also 4x faster and 7x smaller than the concurrent FastSAM, making it more suitable for mobile applications. Since our MobileSAM keeps all the pipeline of the original SAM and just replaces the image encoder, it can be converted from a heavyweight SAM to a lightweight SAM for existing SAM-based projects.

Guess you like

Origin blog.csdn.net/a486259/article/details/131463023