Fast SAM released by the Institute of Automation, Chinese Academy of Sciences | The accuracy is equivalent, and the speed is increased by 50 times! ! !

Title: Fast Segment Anything

PDF: https://arxiv.org/pdf/2306.12156v1.pdf

Code: https://github.com/casia-iva-lab/fastsam

guide

SAM has become a fundamental step for many advanced tasks such as image segmentation, image description, and image editing. However, its huge computational overhead limits its wide application in industrial scenarios. This computational overhead mainly comes from the Transformer architecture that handles high-resolution inputs. Therefore, this paper proposes an accelerated alternative with comparable performance . By reformulating the task as segmentation generation and hints, the authors find that a regular CNN detector combined with the instance segmentation branch can also perform well on this task. Specifically, this paper transforms this task into a well-studied instance segmentation task, and uses only 1/50 of the SA-1B dataset released by SAM authors to train existing instance segmentation methods. Using this approach, the authors achieve comparable performance to the SAM approach at a 50x faster runtime speed. This paper provides sufficient experimental results to demonstrate its effectiveness.

introduction

SAM is regarded as a landmark vision-based model that can guide the segmentation of any object in an image through various user interaction cues. SAM utilizes a Transformer model trained on the extensive SA-1B dataset, enabling it to adeptly handle a wide variety of scenes and objects. SAM has pioneered an exciting new task, Segment Anything. Because of its generality and potential, this task has all the ingredients to be the cornerstone of a wide range of vision tasks in the future. However, although SAM and its successor models have shown promising results in handling segment anything tasks, their practical application is still challenging. The obvious problem is the massive computational resource requirements associated with the Transformer (ViT) model, a major part of the SAM architecture. Compared with convolutional models, ViT stands out for its huge computational resource requirement, which poses an obstacle for its practical deployment, especially in real-time applications. This limitation thus hinders the progress and potential of the segment anything task.

Given the high demand for segment anything models in industrial applications, this paper designs a real-time solution, called FastSAM, for the segment anything task. This paper decomposes the segment anything task into two consecutive stages, namely full instance segmentation and hint-guided selection. The first stage relies on the implementation of a convolutional neural network (CNN) based detector. It generates segmentation masks of all instances in an image. Then in the second stage, it outputs the regions of interest corresponding to the cues. By exploiting the computational efficiency of CNNs, this paper demonstrates that real-time segment anything models can be achieved with little loss of performance quality. This paper hopes that the proposed method can facilitate the industrial application to the fundamental task of segment anything.

::: block-1
Figure 1. Performance comparison analysis of FastSAM and SAM

(a) Speed ​​comparison of FastSAM and SAM on a single NVIDIA GeForce RTX 3090. (b) Comparison of edge detection on the BSDS500 dataset [1, 28]. Comparison of FastSAM and SAM in Box AR@1000 evaluation of object proposals on © COCO dataset [25]. Both SAM and FastSAM use PyTorch for inference, only FastSAM(TRT) uses TensorRT for inference.
:::

The FastSAM proposed in this paper is based on the object detector YOLOv8-seg of the instance segmentation branch of the YOLACT method. Furthermore, the extensive SA-1B dataset released by SAM is also adopted, by directly training this CNN detector on only 2% (1/50) of the SA-1B dataset, it achieves comparable performance to SAM, but Computational and resource requirements are greatly reduced, enabling real-time applications. We also apply it to several downstream segmentation tasks, demonstrating its generalization performance. On the object proposal task on MS COCO, the method achieves 63.7 on AR1000, which is 1.2 points higher than SAM using 32×32 point hint input, but runs 50 times faster on a single NVIDIA RTX 3090.

Real-time segment anything models are very valuable for industrial applications. It can be applied in many scenarios. The proposed method not only provides new practical solutions for a large number of vision tasks, but also is very fast, tens or hundreds of times faster than current methods. Furthermore, it provides new perspectives on large-scale model architectures for general vision tasks. For a specific task, a specific model can still take advantage to achieve a better efficiency-accuracy balance.

From the perspective of model compression, our method shows a feasible path to significantly reduce the amount of computation by introducing artificial prior structures. The contributions of this paper can be summarized as follows:

  • A novel real-time CNN-based solution to the Segment Anything task is introduced, which significantly reduces computational requirements while maintaining competitive performance.
  • This study presents for the first time the application of CNN detectors to segment anything tasks and provides insights into the potential of lightweight CNN models in complex vision tasks.
  • Through the comparative evaluation of the proposed method and SAM on multiple benchmarks, the advantages and disadvantages of the proposed method in the field of segment anything are revealed.

method

Figure 2 below shows the FastSAM network architecture diagram. The method consists of two stages, full instance segmentation and hint-guided selection. The former stage is the base stage, and the second stage is essentially task-oriented post-processing. Different from the end-to-end Transformer method, the ensemble method introduces many human priors that match the visual segmentation task, such as local connections of convolutions and receptive field-dependent object assignment strategies. This makes it tailored for visual segmentation tasks and converges faster with a smaller number of parameters.

::: block-1
Figure 2. FastSAM network architecture diagram

FastSAM consists of two stages: all-instance segmentation (AIS) and hint-guided selection (PGS). First use YOLOv8-seg to segment all objects or regions in the image. Various cues are then used to identify specific objects of interest. It mainly involves the use of point hints, box hints and text hints.
:::

instance segmentation

The architecture of YOLOv8 is based on its predecessor YOLOv5, incorporating key designs from recent algorithms such as YOLOX, YOLOv6, and YOLOv7. The backbone network and feature fusion module (neck module) of YOLOv8 replaces the C3 module of YOLOv5 with the C2f module. The updated head module adopts a decoupled structure, separates classification and detection, and shifts from Anchor-based methods to Anchor-Free-based methods.

YOLOv8-seg applies the instance segmentation principle of YOLACT. It extracts features from images through the backbone network and Feature Pyramid Network (FPN), integrating features of different scales. The output includes detection branch and segmentation branch. The detection branch outputs the class and bounding box of the object, while the segmentation branch outputs k prototypes (32 by default in FastSAM) and k mask coefficients. Segmentation and detection tasks are computed in parallel. The segmentation branch inputs high-resolution feature maps, preserves spatial details, and contains semantic information. This feature map is processed by a convolutional layer, upsampled, and then output as a mask by two other convolutional layers. Similar to the classification branch for detecting heads, the mask coefficients range from -1 to 1. Instance segmentation results are obtained by multiplying and summing mask coefficients with prototypes.

YOLOv8 can be used for various object detection tasks. And through the instance segmentation branch, YOLOv8-Seg is very suitable for the segment anything task, which aims to accurately detect and segment each object or region in the image, regardless of the category of the object. Prototype and mask coefficients provide a lot of scalability for hint guidance. For example, a simple hint encoder and decoder structure can be additionally trained with various hint and image feature embeddings as input and mask coefficients as output. In FastSAM, this paper directly uses the YOLOv8-seg method for the full instance segmentation stage.

Prompt guide selection

After successfully segmenting all objects or regions in an image using YOLOv8, the second stage of the segment anything task is to utilize various cues to identify specific objects of interest. This mainly involves the use of point hints, box hints and text hints.

hint

The goal of point hinting is to match the selected point with the various masks obtained in the first stage to determine which mask the point lies in. Similar to SAM where foreground/background points are used as hints in the method. In cases where foreground points are located in multiple masks, background points can be exploited to filter out masks that are not relevant to the current task. By using a set of foreground/background points, we are able to select multiple masks within the region of interest. These masks will be merged into a single mask that fully marks the object of interest. In addition, morphological operations can also be utilized to improve the performance of mask merging.

box tip

Box hinting involves IoU (Intersection Over Union) matching of selected boxes with corresponding bounding boxes in the first stage. The goal is to identify the mask with the highest IoU score to the selected box and thus select the object of interest.

text prompt

In the case of text cues, we extract the corresponding embeddings of the text using the CLIP model. Then, image embeddings that match the intrinsic features of each mask are determined and matched using a similarity measure. The mask with the highest similarity score to the text-prompted image embedding is chosen.

By carefully implementing these hint-based selection techniques, FastSAM can reliably select specific objects of interest from segmented images. The above method provides an efficient way to complete the segment anything task in real time, thus greatly enhancing the practicability of the YOLOv8 model in complex image segmentation tasks. More efficient hint-based selection techniques are left for future exploration.

Experimental results

::: block-1

Speed ​​comparison of SAM and FastSAM on a single NVIDIA GeForce RTX 3090 GPU. It can be seen that FastSAM outperforms SAM in all the number of hints. Also, FastSAM operates independently of the number of hints, making it a better choice for "Everything mode".
:::

FastSAM segmentation results

Edge detection zero-shot capability evaluation-quantitative index evaluation

Edge detection zero-shot capability evaluation-visualization result evaluation

::: block-1

Comparison with no learning methods on all categories of COCO. Here we report the average recall (AR) and AUC comparison results of the learning-free method, the deep learning-based method (trained on VOC), and our method and the SAM method on all generalizations.
:::

Comparison with OLN and SAM-H

::: block-1

In the application of anomaly detection, SAM-point/box/everything respectively represent the use point prompt, box prompt and all modes.
:::

::: block-1

Application in saliency segmentation, where SAM-point/box/everything means using point-hint, box-hint, and everything patterns, respectively.
:::

::: block-1

Application in building extraction, where SAM-point/box/everything means using point-hint, box-hint, and everything modes, respectively.
:::

::: block-1

Compared with SAM, FastSAM can generate finer segmentation masks on narrow regions of large objects.
:::

Limitations

Overall, FastSAM is comparable in performance to SAM and is 50 times faster than SAM (32×32) and 170 times faster than SAM (64×64). Its operating speed makes it a good choice for industrial applications such as road obstacle detection, video instance tracking, and image processing. On some images, FastSAM is even able to generate better masks for large-sized objects.

Figure 11

However, as shown in the experiments, FastSAM has a clear advantage in generating boxes, but its mask generation performance is lower than that of SAM, as shown in Figure 11 above. FastSAM has the following features:

  • Low-quality, small-size segmentation masks have higher confidence scores. The authors believe this is because the confidence score is defined as the bounding box score of YOLOv8, which has little to do with the mask quality. Changing the network to predict IoU or other quality metrics for masks is one way to improve.
  • Masks for some tiny-sized objects tend to be close to square. In addition, the mask of large-sized objects may have some artifacts at the edges of the bounding box, which is the weakness of the YOLACT method. This problem can be expected to be solved by enhancing the capabilities of the mask prototype or redesigning the mask generator.

in conclusion

In this paper, we rethink the task and model architecture choices of Segment Anything and propose an alternative that runs 50 times faster than SAM-ViT-H (32×32). Experiments prove that FastSAM can solve multiple downstream tasks well. However, FastSAM still has some weaknesses that can be improved, such as the scoring mechanism and instance mask generation paradigm. These issues will be left for future research to address.

Guess you like

Origin blog.csdn.net/CVHub/article/details/131487868