Application of SAM in Remote Sensing Image Segmentation

Title: RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Model
Paper: https://arxiv.org/abs/2306.16269
Code: https://github.com/KyanChen/RSPrompter

guide

By using a large amount of training data (SA-1B), the basic "Segment Anything Model" (SAM) proposed by Meta AI Research has shown significant generalization and zero-sample capabilities. Nonetheless, SAM appears to be a class-agnostic instance segmentation method that relies heavily on prior manual guidance, including points, boxes, and coarse masks. Furthermore, the performance of SAMs on remote sensing image segmentation tasks has not been fully explored and demonstrated.

This paper considers the design of an automatic instance segmentation method based on the SAM basic model, which incorporates semantic category information for remote sensing images. Inspired by prompt learning, this paper learns to generate a suitable prompt as the input of SAM. This enables SAM to generate semantically discriminative segmentation results for remote sensing images, a method called RSPrompter. This paper also proposes several SAM-based instance segmentation derivatives based on recent developments in the SAM community, and compares their performance with RSPrompter. Extensive experimental results on WHU Building, NWPU VHR-10 and SSDD datasets validate the effectiveness of the proposed method.

Thanks to training on more than a billion masks, SAM can segment any object in any image without requiring additional training, demonstrating its remarkable generalization ability across a wide variety of images and objects. This creates new possibilities and avenues for intelligent image analysis and understanding. However, due to its interactive framework, SAM needs to provide prior prompts, such as points, boxes or masks to behave as a class-independent segmentation method, as shown in Figure (a) below. Clearly, these limitations make SAM unsuitable for fully automatic interpretation of remote sensing images.

insert image description here
Furthermore, we observe that complex background disturbances and lack of well-defined object edges in remote sensing imagery scenes pose significant challenges to SAM's segmentation capabilities. It is difficult for SAM to achieve a complete segmentation of remote sensing image targets, and the results depend heavily on the type, location and number of prompts. In most cases, a fine manual prompt is crucial to achieve the desired effect, as shown in (b) above. This shows that SAMs have considerable limitations when applied to instance segmentation of remote sensing images.

In order to enhance the remote sensing image instance segmentation ability of the basic model, this paper proposes RSPrompter for learning how to generate prompts that can enhance the capabilities of the SAM framework. The motivation for this paper lies in the SAM framework, where each prompt group can obtain an instantiation mask via a mask decoder. Imagine if we could automatically generate multiple class-related prompts, a SAM decoder would be able to produce multiple instance-level masks with class labels. However, there are two main challenges in this process: (i) Where do category-related prompts come from? (ii) Which type of prompt should be chosen as the input of the mask decoder?

Since SAM is a category-independent segmentation model, the deep feature maps of its encoder cannot contain rich semantic category information. To overcome this obstacle, we extract mid-layer features of the encoder to form the input of Prompter, which generates prompts containing semantic category information. Second, SAM prompts include points (foreground/background points), boxes or masks. Considering that generating point coordinates needs to be searched in the original SAMprompt manifold, which severely limits the optimization space of the prompter, we further relax the representation of the prompt and directly generate the prompt embedding, which can be understood as the embedding of points or boxes instead of original coordinates. This design also avoids the barrier of gradient flow from high-dimensional to low-dimensional and back to high-dimensional features, that is, from high-dimensional image features to point coordinates and then to position encoding.

This paper also provides a comprehensive survey and summary of current advances and derived approaches in the SAM modeling community. These mainly include methods based on SAM backbone networks, methods integrating SAM with classifiers, and techniques combining SAM with detectors.

method

SAM model

SAM is an interactive segmentation framework that generates segmentation results from given prompts such as foreground/background points, bounding boxes, or masks. It consists of three main components: image encoder, prompt encoder and mask decoder. SAM uses a Vision Transformer (ViT)-based pre-trained mask autoencoder to process images into intermediate features and encode previous prompts into embedded tokens. Subsequently, the cross-attention mechanism in the mask decoder facilitates the interaction between image features and prompt embeddings to finally produce mask outputs. The process can be expressed as:

insert image description here
Besides RSPrompter proposed in this paper, three other SAM-based instance segmentation methods are introduced for comparison, as shown in (a), (b) and © below. This paper evaluates their effectiveness on remote sensing image instance segmentation tasks and inspires future research. These methods include: an external instance segmentation header, classifying masked categories, and using external detectors, called SAM-seg, SAM-cls, and SAM-det, respectively.

Instance Segmentation Extensions for SAM

insert image description here
SAM-seg

SAM-seg exploits the knowledge that the SAM image encoder exists while keeping the encoder constant. It extracts mid-layer features from the encoder, uses convolutional blocks for feature fusion, and then uses existing instance segmentation (Mask R-CNN and Mask2Former) to perform instance segmentation tasks. This process can be expressed as:

insert image description here
SAM-cls

In SAM-cls, the "full image" mode of SAM is first utilized to segment all potential instance objects in the image. This is achieved by uniformly distributing points throughout the image and treating each point as prompt input for an instance. After obtaining all instance masks in an image, a classifier can be used to assign a label to each mask. This process can be described as follows:

insert image description here

For convenience, this article directly uses the lightweight ResNet18 to mark the mask. Second, the pre-trained CLIP model can be utilized, enabling SAM-cls to run without additional training to achieve zero-shot performance.

SAM-it

The SAM-det method is simpler and more straightforward, and has been widely adopted by the community. An object detector is first trained to identify the desired object in the image, and then the detected bounding box is fed into the SAM as a prompt. The whole process can be described as:

insert image description here

RSPrompter

overview

Figure (d) above shows the structure of the proposed RSPrompter. Our goal is to train a SAM-oriented prompter that can process any image in the test set while locating objects, inferring their semantic categories and instance masks, which can be expressed as The following formula:

insert image description here
Images are processed through a frozen SAM image encoder, generating Fimg and multiple intermediate feature maps Fi. Fimg is used in the SAM decoder to obtain prompt-guided masks, while Fi is step-by-step processed by an efficient feature aggregation and prompt generator to obtain multiple sets of prompts and corresponding semantic categories. In order to design the prompt generator, this paper adopts two different structures, namely the anchor point type and the query type.

feature aggregator

SAM is a prompt-based category-independent segmentation model. In order to obtain semantically relevant and discriminative features without increasing the computational complexity of the prompter, this paper introduces a lightweight feature aggregation module. As shown in the figure below, this module learns to represent semantic features from various intermediate feature layers of the SAM ViT backbone network, which can be recursively described as:

insert image description here

Anchor Prompter

architecture

First, an anchor-based Region Proposal Network (RPN) is used to generate candidate object boxes. Next, visual feature representations of individual objects from position-encoded feature maps are obtained via RoI pooling. Three perceptual heads are derived from visual features: semantic head, localization head and prompt head. The semantic header identifies a specific object category, while the localization header establishes a matching criterion between the generated prompt representation and the object instance mask, i.e. localization-based greedy matching. The prompt header generates the prompt embedding required by the SAM mask decoder. The whole process is shown in the figure below and can be expressed by the following formula:

insert image description here
Losses The
losses of this model include binary classification loss and localization loss for RPN network, classification loss for semantic head, regression loss for localization head, and segmentation loss for frozen SAM mask decoder. The total loss can be expressed as:

insert image description here

Query Prompter

Architecture
Anchor Prompter is relatively complex, involving the use of bounding box information for mask matching and supervised training. To simplify this process, a query-based prompter is proposed, which is based on optimal transfer. Query Prompter mainly consists of a lightweight Transformer encoder and decoder. The encoder is used to extract high-level semantic features from images, while the decoder converts preset learnable queries into prompt embeddings required by SAM through attention interaction with image features. The whole process is shown in the figure below, which can be expressed as:

insert image description here
loss

The training process of the query prompter mainly involves two key steps: (i) matching the mask decoded by the SAM mask decoder with the real instance mask; (ii) followed by supervised training using the matched labels. When performing optimal transfer matching, we define the matching cost considering the predicted class and mask as follows:

insert image description here
Once each predicted instance is paired with its corresponding ground truth, a supervision term can be applied. This mainly includes multi-class classification and binary mask classification, as described below:

insert image description here

experiment

Three public remote sensing instance segmentation datasets are used in this paper: WHU Building Extraction dataset, NWPU VHR-10 dataset and SSDD dataset. The WHU dataset is single-class building target extraction segmentation, NWPU VHR-10 is multi-class target detection segmentation, and SSDD is SAR ship target detection segmentation. Model performance evaluation using mAP.

Results at WHU

insert image description here
insert image description here
insert image description here

Results on NWPU

insert image description here
insert image description here

Results on SSDD

insert image description here
insert image description here

Summarize

In this paper, we introduce RSPrompter, a prompt learning method for instance segmentation of remote sensing images, utilizing the SAM base model. The goal of RSPrompter is to learn how to generate prompt inputs for a SAM, enabling it to automatically obtain semantic instance-level masks. In contrast, the original SAM requires additional manual prompts and is a category-independent segmentation method. The design philosophy of RSPrompter is not limited to SAM models, but can also be applied to other base models. Based on this concept, we designed two specific implementation schemes: RSPrompter-anchor based on preset anchor points and RSPrompter-query based on query and optimal transmission matching. Furthermore, we survey and propose various methods and variants for this task in the SAM community and compare them with our prompt learning approach. The effectiveness of each component in RSPrompter is verified by ablation experiments. Meanwhile, experimental results on three public remote sensing datasets demonstrate that our method outperforms other state-of-the-art instance segmentation techniques, as well as some SAM-based methods.

reference

https://mp.weixin.qq.com/s/CkJ6vlH9nbhWjj0rt68sDg

Guess you like

Origin blog.csdn.net/weixin_42990464/article/details/131508773