Semantic segmentation large model RSPrompter paper reading

Paper link

RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual
Foundation Model

Open source code link

RSPrompter

paper reading

Summary

Abstract—Leveraging vast training data (SA-1B), the foundation Segment Anything Model (SAM) proposed by Meta AI Research exhibits remarkable generalization and zero-shot capabilities. Nonetheless, as a category-agnostic instance segmentation method, SAM heavily depends on prior manual guidance involving points, boxes, and coarse-grained masks. Additionally, its performance on remote sensing image segmentation tasks has yet to be fully explored and demonstrated. In this paper, we consider designing an automated instance segmentation approach for remote sensing images based on the SAM foundation model, incorporating semantic category information. Inspired by prompt learning, we propose a method to learn the generation of appropriate prompts for SAM input. This enables SAM to produce semantically discernible segmentation results for remote sensing images, which we refer to as RSPrompter. We also suggest several ongoing derivatives for instance segmentation tasks, based on recent developments in the SAM community, and compare their performance with RSPrompter. Extensive experimental results on the WHU building, NWPU VHR-10, and SSDD datasets validate the efficacy of our proposed method. Our code is accessible at https://kyanchen.github.io/RSPrompter.

Abstract—Using a large amount of training data (SA-1B), the base segment arbitrary model (SAM) proposed by Meta-AI Research has remarkable generalization ability and zero-shot ability. Nonetheless, as a class-agnostic instance segmentation method, SAM relies heavily on previous manual guidance involving points, boxes, and coarse-grained masks . Furthermore, its performance in remote sensing image segmentation tasks remains to be fully explored and demonstrated. In this paper, we consider designing an automatic instance segmentation method for remote sensing images based on the SAM basic model, incorporating semantic category information. Inspired by just-in-time learning, we propose a method to learn the generation of appropriate cues for SAM inputs . This enables the SAM to produce semantically discriminative segmentation results for remote sensing images, which we call RSPrompter. Based on recent developments, we also propose several ongoing derivatives, such as segmentation tasks, based on recent developments in the SAM community, and compare their performance with RSPrompter.
insert image description here

introduction

Deep learning algorithms have exhibited remarkable potential in instance segmentation for remote sensing images, demonstrating their ability to extract deep, discernible features from raw data [16–19]. Presently, the primary instance segmentation algorithms comprise two-stage R-CNN series algorithms (e.g., Mask R-CNN [20], Cascade Mask R-CNN [21], Mask Scoring R-CNN [22], HTC [23], and HQ-ISNet [1]), as well as one-stage algorithms (e.g., YOLACT [24], BlendMask [25], EmbedMask [26], Condinst [27], SOLO [28], and Mask2former [29]). However, the complexity of remote sensing image backgrounds and the diversity of scenes limit the generalization and adaptability of these algorithms.Therefore, devising instance segmentation models capable of

Deep learning algorithms have shown remarkable potential in remote sensing image segmentation, demonstrating their ability to extract deep discriminative features from raw data [16-19]. Currently, major instance segmentation algorithms include the two-stage R-CNN family of algorithms (e.g., Mask R-CNN [20], Cascade Mask R-CNN [21], Mask Scoring R-CNN [22], HTC [23], and HQ ISNet [1]) , and one-stage algorithms (such as YOLACT [24], BlendMask [25], EmbedMask [26], Condinst [27], SOLO [28], and Mask2former [29]) . However, the complexity of remote sensing image backgrounds and the diversity of scenes limit the generality and adaptability of these algorithms.

In recent years, substantial progress has been made in foundation models, such as GPT-4 [30], Flamingo [31], and SAM [32], significantly contributing to the advancement of human society. Despite remote sensing being characterized by its big data attributes since its inception [11, 33], foundation models tailored for this field have not yet emerged. In this paper, our primary aim is not to develop a universal foundation model for remote sensing, but rather to explore the applicability of the SAM segmentation foundation model from the computer vision domain to instance segmentation in remote sensing imagery. We anticipate that such foundation models will foster the continued growth and progress of the remote sensing field

In recent years, basic models have made substantial progress, such as GPT-4 [30], Flamingo [31], and SAM [32], making significant contributions to the progress of human society. Although remote sensing has been characterized by its big data attributes since its inception [11, 33], fundamental models tailored for this field have yet to emerge. In this paper, our main purpose is not to develop a general remote sensing base model, but to explore the applicability of the SAM segmentation base model from the field of computer vision to remote sensing image instance segmentation. We anticipate that these fundamental models will foster continued growth and advancement in the field of remote sensing.

Since SAM is a category-agnostic segmentation model, the deep feature maps of the image encoder are unlikely to contain rich semantic category information. To overcome this, we extract features from the intermediate layers of the encoder to form the input of the prompter, which generates prompts containing semantic category information. Secondly, SAM prompts include points (foreground/background points), boxes, or masks. Considering that generating point coordinates requires searching in the original SAM prompt’s manifold, which greatly limits the optimization space of the prompter, we further relax the representation of prompts and directly generate prompt embeddings, which can be understood as the embeddings of points or boxes, instead of the original coordinates. This design also avoids the obstacle of gradient flow from high-dimensional to low-dimensional and then back to high-dimensional features, i.e., from high-dimension image features to point coordinates and then to positional encodings.

Since SAM is a class-agnostic segmentation model, the deep feature maps of image encoders are unlikely to contain rich semantic class information. To overcome this, we extract features from intermediate layers of the encoder to form the input of the prompter, which generates hints containing semantic category information . Second, SAM cues include points (foreground/background points), boxes or masks. Considering that generating point coordinates needs to be searched in the manifold of the original SAM hints, which greatly limits the optimization space of the hinter, we further relax the representation of hints and directly generate hint embeddings, which can be understood as point or box embeddings instead of original coordinates . This design also avoids the barrier of gradient flow from high-dimensional to low-dimensional and back to high-dimensional features, that is, from high-dimensional image features to point coordinates and then to position encoding.

This paper also provides a comprehensive survey and summary of current advances and derived approaches in the SAM modeling community. These mainly include methods based on SAM backbone networks, methods integrating SAM with classifiers, and techniques combining SAM with detectors.

related work

Instance Segmentation Based on Deep Learning

The objective of instance segmentation is to identify the location of each target instance within an image and provide a corresponding semantic mask, making this task more challenging than object detection and semantic segmentation [18, 38]. Current deep learning-based instance segmentation approaches can be broadly categorized into two-stage and single-stage methods. The former primarily builds on the Mask R-CNN [20] series, which evolved from the two-stage Faster R-CNN [39] object detector by incorporating a parallel mask prediction branch. As research advances, an increasing number of researchers are refining this framework to achieve improved performance. PANet [16] streamlines the information path between features by introducing a bottom-up path based on FPN [40]. In HTC [23], a multi-task, multi-stage hybrid cascade structure is proposed, and the spatial context is enhanced by integrating the segmentation branch, resulting in significant performance improvements over Mask R-CNN and Cascade Mask R-CNN [21]. The Mask Scoring R-CNN [22] incorporates a mask IoU branch within the Mask R-CNN framework to assess segmentation quality. The HQ-ISNet [1] introduces an instance segmentation method for remote sensing imagery based on Cascade Mask R-CNN, which fully leverages multi-level feature maps and preserves the detailed information contained within high-resolution images

The goal of instance segmentation is to identify the location of each object instance in an image and provide a corresponding semantic mask, which makes this task more challenging than object detection and semantic segmentation [18, 38]. Current deep learning-based instance segmentation methods can be broadly classified into two-stage and single-stage methods. The former is mainly built on the Mask R-CNN [20] series, which evolved from the two-stage Faster R-CNN [39] object detector, adding a parallel mask prediction branch. As the research progresses, more and more researchers are improving this framework to improve performance. PANet [16] simplifies the information path between features by introducing an FPN-based bottom-up path [40]. In HTC [23], a multi-task, multi-stage hybrid cascade structure is proposed, and the spatial context is enhanced by integrating segmentation branches, which makes the performance significantly better than Mask R-CNN and cascade Mask R-CNN Improve [21]. Mask Scoring R-CNN [22] added a Mask IoU branch to the Mask R-CNN framework to evaluate the segmentation quality. HQ-ISNet [1] introduces a remote sensing image instance segmentation method based on Cascade Mask R-CNN, which makes full use of multi-level feature maps and preserves the detailed information contained in high-resolution images

Though two-stage methods can yield refined segmentation results, achieving the desired latency in terms of segmentation speed remains challenging. With the growing popularity of single-stage object detectors, numerous researchers have endeavored to adapt single-stage object detectors for instance segmentation tasks. YOLACT [24], for example, approaches the instance segmentation task by generating a set of prototype masks and predicting mask coefficients for each instance.CondInst [27] offers a fresh perspective on the instance segmentation problem by employing a dynamic masking head.This novel approach has outperformed existing methods like Mask R-CNN in terms of instance segmentation performance.SOLO [28] formulates the instance segmentation problem as predicting semantic categories and generating instance masks for each pixel in the feature map. With the widespread adoption of Transformers [41], DETR [42] has emerged as a fully 3 end-to-end object detector. Drawing inspiration from the task modeling and training procedures used in DETR, Maskformer [43] treats segmentation tasks as mask classification problems, but it suffers from slow convergence speed. Mask2former [29] introduces masked attention to confine cross-attention to the foreground region, significantly improving network training speed

While the two-stage approach can produce accurate segmentation results, it is still challenging to achieve the required latency in terms of segmentation speed. With the increasing popularity of single-stage object detectors, many researchers have devoted to using single-stage object detectors for instance segmentation tasks. For example, YOLACT [24] tackles the task of instance segmentation by generating a set of prototype masks and predicting the mask coefficients for each instance. CondInst [27] provides a new perspective on the instance segmentation problem by using dynamic masking heads. This novel approach outperforms existing methods such as Mask R-CNN in terms of instance segmentation performance.
SOLO [28] formulates the instance segmentation problem as predicting semantic categories and generating instance masks for each pixel in a feature map. With the widespread adoption of Transformer [41], DETR [42] has become a complete end-to-end object detector. Maskformer [43], inspired by the task modeling and training process used in DETR, treats the segmentation task as a mask classification problem, but its convergence rate is slow. Mask2former [29] introduces masked attention to limit cross-attention to foreground regions, which significantly improves the training speed of the network

Prompt Learning

For many years, machine learning tasks primarily focused on fully supervised learning, where in task-specific models were trained solely on labeled examples of the target task [63, 64]. Over time, the learning of these models has undergone a significant transformation, shifting from fully supervised learning to a “pre-training and fine-tuning” paradigm for downstream tasks. This allows models to utilize the general features obtained during pre-training [65–68]. As the field has evolved, the “pre-training and fine-tuning” approach is being increasingly replaced with a “pre-training and prompting” paradigm [61, 62, 69–72]. In this new paradigm, researchers no longer adapt the model exclusively to downstream tasks but rather redesign the input using prompts to reconstruct downstream tasks to align with the original pre-training task

For many years, machine learning tasks have largely focused on fully supervised learning, where task-specific models are trained only on labeled examples of the target task [63, 64]. Learning of these models has undergone a major shift over time, from fully supervised learning to a “pretrain and finetune” paradigm for downstream tasks. This allows the model to exploit general features obtained during pre-training [65-68]. As the field develops, “pretrain and fine-tune” approaches are increasingly being replaced by the “pretrain and hint” paradigm [61, 62, 69–72] . In this new paradigm, instead of specialized models for downstream tasks, researchers redesign inputs using hints to restructure downstream tasks to be consistent with the original pre-trained tasks.
Prompt learning can help reduce semantic differences (bridging the gap between pre-training and fine-tuning) and prevent overfitting of the head. Since the introduction of GPT-3 [55], prompt learning has progressed from traditional discrete [71] and continuous prompt construction [61, 72] to largescale model-oriented in-context learning [31], instructiontuning [73–75], and chain-of-thought approaches [76–78].Current methods for constructing prompts mainly involve manual templates, heuristic-based templates, generation, word embedding fine-tuning, and pseudo tokens [71, 79]. In this paper, we propose a prompt generator that generates SAMcompatible prompt inputs. This prompt generator is categoryrelated and produces semantic instance segmentation results

Hint learning helps reduce semantic differences (bridging the gap between pre-training and fine-tuning) and prevents head overfitting. Since the introduction of GPT-3 [55], hint learning has evolved from traditional discrete [71] and continuous hint construction [61,72] to large-scale model-oriented context learning [31], instruction tuning [73-75] and thinking chain method [76-78]. Current methods for constructing prompts mainly include manual templates, heuristic templates, generation, word embedding fine-tuning, and pseudo-labeling [71, 79]. In this paper, we propose a hint generator that generates SAM-compatible hint input. This hint generator is associated with categories and produces semantic instance segmentation results.

method

In this section, we will present our proposed RSPrompter, a learning-to-prompt method based on the SAM framework for remote sensing image instance segmentation. We will cover the following aspects: revisiting SAM, incorporating simple extensions to tailor SAM for the instance segmentation task, and introducing both anchor-based and query-based RSPrompter along with their respective loss functions

In this section, we introduce our proposed RSPrompter, a learning-prompting method for instance segmentation of remote sensing images based on the SAM framework. We introduce the following aspects: revisiting SAMs, incorporating simple extensions to tailor SAMs for instance segmentation tasks, introducing anchor-based and query-based RSPrompter and their respective loss functions .

1. Re-examine SAM

SAM is an interactive segmentation framework that generates segmentation results based on given prompts, such as foreground/background points, bounding boxes, or masks. It comprises three main components: an image encoder (Φi-enc), a prompt encoder (Φp-enc), and a mask decoder (Φm-dec). SAM employs a pre-trained Masked AutoEncoder (MAE) [60] based on Vision Transformer (ViT) [80] to process images into intermediate features, and encodes the prior prompts as embedding tokens. Subsequently, the cross-attention mechanism within the mask decoder facilitates the interaction between image features and prompt embeddings, ultimately resulting in a mask output. This process can be illustrated in Fig. 2 and expressed as:

SAM is an interactive segmentation framework that generates segmentation results based on given cues such as foreground/background points, bounding boxes, or masks. It consists of three main components: image encoder (Φi-enc), hint encoder (Φp-enc) and mask decoder (Φm-dec) . SAM uses a pre-trained mask autoencoder (mask AutoEncoder, MAE) [60] based on Vision Transformer (ViT) [80] to process images into intermediate features and encode previous prompts into embedded tokens. Subsequently, the cross-attention mechanism inside the mask decoder facilitates the interaction between image features and cue embeddings to finally produce a mask output. The process is shown in Figure 2, expressed as:

insert image description here
A schematic diagram of the SAM is depicted, which includes an image encoder, a hint encoder, and a mask decoder. SAM generates corresponding object masks based on the input hints provided.
insert image description here
where I ∈ RH×W ×3 represents the original image, Fimg ∈ Rh×w×c denotes the intermediate image features, {p} encompasses the sparse prompts including foreground/background points and bounding boxes, and Tprompt ∈ Rk×c signs the sparse prompt tokens encoded by Φp-enc. Additionally, Fc-mask ∈ Rh×w×c refers to the representation from the coarse mask, which is an optional input for SAM, while Tout ∈ R5×c consists of the pre-inserted learnable tokens representing four different masks and their corresponding IoU predictions. Finally, M corresponds to the predicted masks. In this task, diverse outputs are not required, so we directly select the first mask as the final prediction

其中 I ∈ R H × W × 3 \mathcal{I} \in \mathbb{R}^{H \times W \times 3} IRH × W × 3 represents the original image,F img ∈ R h × w × c F_{\text {img }} \in \mathbb{R}^{h \times w \times c}Fimg Rh × w × c represent intermediate image features, {p} contains sparse prompts including foreground/background points and bounding boxes,T prompt ∈ R k × c T_{\text {prompt }} \in \mathbb{R}^{ k \times c}Tprompt Rk × c meansΦ p -enc \Phi_{\mathrm{p} \text {-enc }}Phip - enc The encoded sparse prompt. Furthermore, F c-mask ∈ R h × w × c F_{\text {c-mask }} \in \mathbb{R}^{h \times w \times c}Fc-mask Rh × w × c is the representation from the coarse mask, which is an optional input to the SAM, andT out ∈ R 5 × c T_{\text {out }} \in \mathbb{R}^{5 \times c }Tout R5 × c consists of pre-inserted learnable tokens representing four different masks and their corresponding IoU predictions. Finally, M corresponds to the prediction mask. In this task, different outputs are not needed, so we directly choose the first mask as the final prediction.

2. Extension of SAM instance segmentation

We have conducted a survey within the SAM community and, in addition to the RSPrompter proposed in this paper, have also introduced three other SAM-based instance segmentation methods for comparison, as shown in Fig. 3 (a), (b), and ©, to assess their effectiveness in remote sensing image instance segmentation tasks and inspire future research. These methods include: an external instance segmentation head, classifying mask categories, and using detected object boxes, which correspond to Fig. 3 (a), (b), and ©, respectively. In the following sections, we will refer to these methods as SAMseg, SAM-cls, and SAM-det, respectively

We conducted a survey within the SAM community, and in addition to the RSPrompter proposed in this paper, we also introduced three other SAM-based instance segmentation methods for comparison, as shown in Figure 3 (a), (b) and ©, to evaluate Their effectiveness in remote sensing image instance segmentation tasks and provide inspiration for future research. These methods include: external instance segmentation head, classifying masked categories, using detected object boxes , corresponding to Fig. 3 (a), (b), © respectively. In the following sections, we will refer to these methods as SAMseg, SAM-cls, and SAM-det, respectively .
insert image description here
Fig. 3. From left to right, the figure illustrates SAM-seg, SAM-cls, SAM-det, and RSPrompter as alternative solutions for applying SAM to remote sensing image instance segmentation tasks. (a) An instance segmentation head is added after SAM’s image encoder. (b) SAM’s “everything” mode generates masks for all objects in an image, which are subsequently classified into specific categories by a classifier. © Object bounding boxes are first produced by an object detector and then used as prior prompts input to SAM to obtain the corresponding masks. (d) The proposed RSPrompter in this paper creates category-relevant prompt embeddings for instant segmentation masks. The snowflake symbol in the figure signifies that the model parameters in this part are kept frozen.

image 3. From left to right, the figure illustrates SAM seg, SAM cls, SAM det, and RSPrompter as alternative solutions for applying SAM to remote sensing image instance segmentation tasks. (a) An instance segmentation header is added after the SAM's image encoder. (b) The "everything" mode of SAM generates masks for all objects in the image, which are then classified by the classifier into a specific class. (c) The object bounding box is first produced by the object detector and then used as the previous hint input to the SAM to obtain the corresponding mask. (d) The RSpromoter proposed in this paper creates category-dependent hint embeddings for on-the-fly segmentation masks. A snowflake symbol in the figure indicates that the model parameters in that section remain frozen.

1) SAM-seg:

In SAM-seg, we make use of the knowledge present in SAM’s image encoder while maintaining the cumbersome encoder frozen. We extract intermediatelayer features from the encoder, conduct feature fusion using convolutional blocks, and then perform instance segmentation tasks with existing instance segmentation heads, such as Mask R-CNN [20] and Mask2Former [29]. This process can be described using the following equations:

In SAM-seg, we exploit the knowledge present in the SAM image encoder while keeping the heavy encoder frozen. We extract intermediate layer features from the encoder, use convolutional blocks for feature fusion, and then use existing instance segmentation heads such as Mask R-CNN [20] and Mask2Former [29] to perform the instance segmentation task. This process can be described by the following equation:
insert image description here
where {Fi} ∈ Rk×h×w×c, i ∈ {1, · · ·, k} represents multilayer semantic feature maps from the ViT backbone. DownConv refers to a 1 × 1 convolution operation that reduces channel dimensions, while [·] denotes concatenation alongside the channel axis. FusionConv is a stack of three convolutional layers with 3×3 kernels and a ReLU activation following each layer. Φext-dec represents the externally inherit ed instance segmentation head , such as Mask R-CNN [20] and Mask2Former [29]. It is important to note that the image encoder remains frozen, and this training method does not utilize multi-scale loss supervision, as in Mask R-CNN and Mask2Former

其中 { F i } ∈ R k × h × w × c , i ∈ { 1 , ⋯   , k } \left\{F_i\right\} \in \mathbb{R}^{k \times h \times w \times c}, i \in\{1, \cdots, k\} { Fi}Rk×h×w×c,i{ 1,,k } denote a multi-layer semantic feature map from the ViT backbone. DownConv refers to a 1 × 1 convolution operation that reduces the channel size, while [ ] indicates concatenation along the channel axis. FusionConv is a stack of three convolutional layers, each with a 3×3 kernel and a ReLU activation. Φ ext-dec \Phi_{\text {ext-dec }}Phiext-dec Represents externally inherited instance segmentation heads such as Mask R-CNN [20] and Mask2Former [29]. It is worth noting that the image encoder remains frozen, and this training method does not use multi-scale loss supervision as in Mask R-CNN and Mask2Former.

2) SAM-cls:

In SAM-cls, we first utilize the “everything” mode of SAM to segment all potential instance targets within the image. Internally, this is achieved by uniformly scattering 5 points throughout the image and treating each point as a prompt input for an instance. After obtaining the masks of all instances in the image, we can assign labels to each mask using a classifier. This process can be described as follows:

In SAM-cls, we first leverage the "everything" mode of SAM to segment all potential instance objects in an image. Internally, this is achieved by uniformly scattering 5 points throughout the image and treating each point as a cue input for an instance. After obtaining the masks for all instances in the image, we can use a classifier to assign a label to each mask. This process can be described as follows:
insert image description here
where (xi, yj) represents the point prompt. For every image, we consider 32 × 32 points to generate category-agnostic instance masks. Φext-cls denotes the external mask classifier, and c(i,j) refers to the labeled category. For convenience, we directly use the lightweight ResNet18 [68] to label the masks.It performs classification by processing the original image patch cropped by the mask. When cropping the image, we first enlarge the crop box by 2 times and then blur non-mask areas to enhance the discriminative capability. To achieve better performance, the mask’s classification representation could be extracted from the image encoder’s intermediate features, but we chose not to follow that approach for the sake of simplicity in our paper. Alternatively, a pre-trained CLIP model can be leveraged, allowing SAM-cls to operate in a zero-shot regime without additional training.

In the formula, **(xi, yj) represents a dot prompt**. For each image, we consider 32 × 32 points to generate class-independent instance masks. Φ ext-cls \Phi_{\text {ext-cls }}Phiext-cls is the outer mask classifier, and c(i,j) is the labeled category. For convenience, we directly use the lightweight ResNet18 [68] to label the masks . The algorithm performs classification by processing raw image segments after mask cropping. When cropping an image, we first magnify the crop frame by 2 times, and then blur the non-masked area to enhance the resolution of the image. For better performance, a categorical representation of the mask can be extracted from the intermediate features of the image encoder, but we do not adopt this approach in this paper for simplicity. Alternatively, a pre-trained CLIP model can be exploited, allowing SAM-cls to operate in the zero-shot regime without additional training.

3) SAM-it:

The SAM-det method is more straightforward to implement and has gained widespread adoption in the community. First, we train an object detector to identify the desired targets in the image, and then input the detected bounding boxes as prompts into SAM. The entire process can be described as follows:

The SAM-det method is easier to implement and has been widely adopted in the community. First, we train an object detector to recognize objects in images, and then feed the detected bounding boxes into the SAM as prompted. The whole process can be described as follows:
insert image description here
where bi represents bounding boxes detected by the external pre-trained object detector, Φext-det. Here, we employ Faster R-CNN [39] as the detector.

where bi denotes the bounding box detected by an externally pretrained object detector, Φext-det. Here, we adopt Faster R-CNN [39] as the detector.

RSPrompter

1) Overview:

The proposed RSPrompter’s structure is illustrated in Fig. 3 (d). Suppose we have a training dataset, i.e., Dtrain = {(I1, y1), · · · , (IN , yN )}, where Ii ∈ RH×W ×3 denotes an image, and yi = {bi, ci, mi} represents its corresponding ground-truth annotations, including the coordinates of n object bounding boxes (bi ∈ Rni×4), their associated semantic categories (ci ∈ Rni×C), and binary masks (mi ∈ Rni×H×W ). Our objective is to train a prompter for SAM that can process any image from a test set (Ik ∼ Dtest), simultaneously localizing the objects and inferring their semanticcategories and instance masks, which can be expressed as follows:

The structure of RSPrompter proposed in this paper is shown in Fig. 3(d). Suppose we have a training data set, namely D train = { ( I 1 , y 1 ) , ⋯ , ( IN , y N ) } \mathcal{D}_{\text {train }}=\left\{\left (\mathcal{I}_{1}, y_{1}\right), \cdots,\left(\mathcal{I}_{N}, y_{N}\right)\right\}Dtrain ={ (I1,y1),,(IN,yN)},其中 I i ∈ R H × W × 3 \mathcal{I}_{i} \in \mathbb{R}^{H \times W \times 3} IiRH×W×3表示图像, y i = { b i , c i , m i } y_{i}=\left\{b_{i}, c_{i}, m_{i}\right\} yi={ bi,ci,mi} denotes its corresponding ground-truth annotation, including the coordinates of n object bounding boxes (bi ∈ Rni×4), their associated semantic categories (ci ∈ Rni×C) and binary masks (mi ∈ Rni×H× W). Our goal is to train a suggester for SAM that can process any image from the test set (Ik ~ Dtest) while locating objects and inferring their semantic categories and instance masks, which can be expressed as: where the image is
insert image description here
processed by the frozen SAM image encoder to generate Fimg ∈ Rh×w×c and multiple intermediate feature maps {Fi} ∈ Rk×h×w×c. Fimg is used by the SAM decoder to obtain prompt-guided masks, while Fi is progressively processed by an efficient feature aggregator (Φaggregator) and a prompt generator (Φprompter) to acquire multiple groups of prompts (Tj ∈ Rkp×c, j ∈ {1, · · ·, Np}) and corresponding semantic categories (cj ∈ RC, j ∈ {1, · · ·, Np}). kp defines the number of prompt embeddings needed for each mask generation. We will employ two distinct structures to design the prompt generator, namely anchor-based and query-based.

Among them, the image is processed by the frozen SAM image encoder to generate F img ∈ R h × w × c F_{\text {img }} \in \mathbb{R}^{h \times w \times c}Fimg Rh × w × c and multiple intermediate feature maps{ F i } ∈ R k × h × w × c , i ∈ { 1 , ⋯ , k } \left\{F_i\right\} \in \mathbb{R} ^{k \times h \times w \times c}, i \in\{1, \cdots, k\}{ Fi}Rk×h×w×c,i{ 1,,k } . SAM decoder usesF img F_{\text {img }}Fimg A hint-guided mask is obtained, while Fi is progressively processed by an efficient feature aggregator (Φaggregator) and a hint generator (Φprompter), obtaining multiple sets of hints ( T j ∈ R kp × c , j ∈ { 1 , ⋯ , N p } ) \left(T_{j} \in \mathbb{R}^{k_{p} \times c}, j \in\left\{1, \cdots, N_{p}\right\}\right)(TjRkp×c,j{ 1,,Np} ) and corresponding semantic categories( cj ∈ RC , j ∈ { 1 , ⋯ , N p } ) \left(c_{j} \in \mathbb{R}^{\mathcal{C}}, j \in\ left\{1, \cdots, N_{p}\right\}\right)(cjRC,j{ 1,,Np} ) . KP defines the number of hint embeddings required for each mask generation. We will adopt two different structures to design the hint generator, anchor-based and query-based.
It is important to note that Tj only contains foreground target instance prompts, with the semantic category given by cj. A single Tj is a combination of multiple prompts, ie, representing an instance mask with multiple point embeddings or a bounding box. For simplicity, we will omit the superscript k when describing the proposed model.

It is important to note that Tj contains only foreground target instance hints whose semantic categories are given by cj. A single Tj is a combination of multiple cues, i.e., an instance mask represented by multiple point embeddings or a bounding box. For simplicity, we will omit the superscript k when describing the proposed model.

2) Feature Aggregator:

SAM is a category-agnostic segmentation model based on prompts. To obtain semantically relevant and discriminative features without increasing the computational complexity of the prompter, we introduce a lightweight feature aggregation module. This module learns to represent semantic features from various intermediate feature layers of the SAM ViT backbone, as depicted in Fig. 4. The module can be described recursively as follows:

SAM is a cue-based classification-independent segmentation model. To obtain semantically relevant and discriminative features without increasing the computational complexity of prompters, we introduce a lightweight feature aggregation module . This module learns to represent semantic features from various intermediate feature layers of the SAM ViT backbone , as shown in Figure 4.
insert image description here

该模块可以递归地描述如下:
insert image description here
where Fi ∈ Rh×w×c and F ′i ∈ R h 2 × w 2 × c 16 indicate the SAM backbone’s features and down-sampled features generated by ΦDownConv. This process first employs a 1 × 1 Convolution ReLU block to reduce the channels from c to c 16, followed by a 3 × 3 Convolution-ReLU block with a stride of 2 to decrease the spatial dimensions. Since we believe that only coarse information about the target location is necessary, we boldly further reduce the spatial dimension size to minimize computational overhead. ΦConv2D denotes a 3×3 ConvolutionReLU block, while ΦFusionConv represents the final fusion convolutional layers comprising two 3 × 3 convolution layers and one 1×1 convolution layer to restore the original channel dimensions of SAM’s mask decoder

式中, F i ∈ R h × w × c , F i ′ ∈ R h 2 × w 2 × c 16 F_{i} \in \mathbb{R}^{h \times w \times c},F_{i}^{\prime} \in \mathbb{R}^{\frac{h}{2} \times \frac{w}{2} \times \frac{c}{16}} FiRh×w×c,FiR2h×2w×16cRepresents the features of the SAM backbone and the downsampled features generated by ΦDownConv. The process first reduces the channels from c to c/16 using a 1 × 1 convolution-relu block, and then uses a 3 × 3 convolution-relu block with a stride of 2 to reduce the spatial dimension. Since we believe that only coarse information about the target location is necessary, we boldly further reduce the spatial dimension size to minimize the computational overhead. ΦConv2D denotes a 3×3 convolutional relu block, and ΦFusionConv denotes the final fused convolutional layer, consisting of two 3×3 convolutional layers and one 1×1 convolutional layer to restore the original channel dimensions of the SAM mask decoder.

3) Anchor-based Prompter:

Upon obtaining the fused semantic features, we can employ the prompter to generate prompt embeddings for the SAM mask decoder. In this section, we introduce the anchor-based approach for generating prompt embeddings.

3) Anchor-based hinter: After obtaining the fused semantic features, we can use the hinter to generate hint embeddings for the SAM mask decoder. In this section, we introduce an anchor-based approach for generating cue embeddings.

Architecture: First, we generate candidate object boxes using the anchor-based Region Proposal Network (RPN). Next, we obtain the individual object’s visual feature representation from the positional encoding feature map via RoI Pooling [20], according to the proposal. From the visual feature, we derive three perception heads: the semantic head, the localization head, and the prompt head. The semantic head determines a specific object category, while the localization head establishes the matching criterion between the generated prompt representation and the target instance mask, i.e., greedy matching based on localization (Intersection over Union, or IoU). The prompt head generates the prompt embedding required for the SAM mask decoder. The entire process is illustrated in Fig. 5 and can be represented by the following equation:

Architecture: First, we generate candidate object boxes using an anchor-based Region Proposal Network (RPN). Next, we obtain visual feature representations of individual objects from position-encoded feature maps via RoI Pooling [20]. Starting from visual features, we obtain three perception heads: semantic head, localization head and cue head. The semantic header determines the specific object category, while the localization header establishes the matching criteria between the generated hint representation and the target instance mask, that is, localization-based greedy matching (Intersection over Union, or IoU). The hint header generates the hint embeddings required by the SAM mask decoder. The whole process is shown in Figure 5 and
insert image description here
can be expressed by the following formula:
insert image description here
To align the embeddings from SAM's prompt encoder with the generated prompt embeddings, we use sine functions to directly generate high-frequency information rather than predicting it through the network. This is because neural networks have difficulty predicting high-frequency information. The effectiveness of this design has been confirmed through subsequent experiments.

To align the embeddings of the SAM's cue encoder with the generated cue embeddings, we use a sinusoidal function to directly generate high-frequency information instead of predicting it through the network. This is because neural networks have difficulty predicting high-frequency information. The validity of the design is verified by subsequent experiments.

Loss Function: In the anchor-based prompter, the primary framework adheres to the structure of Faster R-CNN [39]. The various losses incorporated within this model include binary classification loss and localization loss for the RPN network, classification loss for the semantic head, regression loss for the localization head, and segmentation loss for the frozen SAM mask decoder. Consequently, the overall loss can be expressed as follows:
Loss Function: In the anchor-based prompter, the main frame follows Faster R-CNN's structure [39]. The various losses included in this model include binary classification loss and localization loss for RPN network, classification loss for semantic head, regression loss for localization head and segmentation loss for frozen SAM mask decoder. Therefore, the total loss can be expressed as:
insert image description here
where Lcls represents the Cross-Entropy loss calculated between the predicted category and the target, while Lreg denotes the SmoothL1 loss computed based on the predicted coordinate offsets and the target offsets between the ground truth and the prior box. Furthermore, Lseg indicates the binary cross-entropy loss between the SAM-decoded mask and the target instance mask label, where the matching criteria are determined by the IoU of the boxes. The indicator function 1 is employed to confirm a positive match. Lastly, Lrpn signifies the region proposal loss.

Among them, Lcls represents the Cross-Entropy loss calculated between the predicted category and the target, and Lreg represents the SmoothL1 loss calculated based on the predicted coordinate offset and the target offset between the ground truth and the prior box. Furthermore, Lseg represents the binary cross-entropy loss between the sam-decoded mask and the target instance mask label, where the matching criterion is determined by the IoU of the box. Indicator function 1 is used to confirm a positive match. Finally, Lrpn represents the region proposal loss.

4) Query-based Prompter:

Query-based hinter:
The anchor-based hinter process is relatively complex, involving mask matching and supervised training using box information. To simplify this process, we propose a query-based prompter based on optimal transfer.

Architecture: The query-based prompter primarily consists of a lightweight Transformer encoder and decoder internally. The encoder is employed to extract high-level semantic features from the image, while the decoder is utilized to transform the preset learnable query into the requisite prompt embedding for SAM via cross-attention interaction with image features. The 7 entire process is depicted in Fig. 6, as follows:

Architecture: The query- based hinter mainly consists of a lightweight Transformer encoder and internal decoder . The encoder is used to extract high-level semantic features from images, and the decoder converts preset learnable queries into hint embeddings required by SAM through cross-attention interaction with image features . The whole process is shown in Figure 6,
insert image description here
as shown in the figure below:
insert image description here
where PE ∈ R h × w × c \mathrm{PE} \in \mathbb{R}^{h \times w \times c}PERh×w×crefers to the positional encoding, while Fquery ∈ RNp×c represents the learnable tokens, which are initialized as zero. Φmlp-cls constitutes an MLP layer employed to obtain class predictions (C ∈ RNp×C). Meanwhile, Φmlp-prompt comprises a two-layer MLP designed to acquire the projected prompt embeddings (T ∈ RNp×kp×c). Np denotes the number of prompt groups, i.e., the number of instances. kp defines the number of embeddings per prompt, i.e., the number of prompts necessary to represent an instance target. Furthermore, ΦT-enc and ΦT-dec symbolize the Transformer encoder and decoder, respectively.

其中 P E ∈ R h × w × c \mathrm{PE} \in \mathbb{R}^{h \times w \times c} PERh × w × c represents the position code,F query ∈ RN p × c F_{\text {query }} \in \mathbb{R}^{N_{p} \times c}Fquery RNp× c represents a learnable token, initialized to 0. Φmlp-cls constitutes an MLP layer for obtaining class predictions( C ∈ RN p × C ) \left(C \in \mathbb{R}^{N_{p} \times \mathcal{C}}\right)(CRNp× C ). Meanwhile, Φmlp-prompt contains a two-layer MLP to obtain the projection prompt embedding (( T ∈ RN p × kp × c ) \left(T \in \mathbb{R}^{N_{p} \times k_{p } \times c}\right)(TRNp×kp× c )). Np represents the number of prompt groups, that is, the number of instances. KP defines the number of embeddings per prompt, i.e. the number of prompts required to represent an instance target. In addition, ΦT-enc and ΦT-dec denote Transformer encoder and decoder, respectively.

Loss Function: The training process for the query-based prompter primarily involves two key steps: (i) matching Np masks, decoded by the SAM mask decoder, to K ground-truth instance masks (typically, Np > K); (ii) subsequently conducting supervised training using the matched labels. While executing optimal transport matching, we define the matching cost, taking into account the predicted category and mask, as detailed below:

Loss Function: The training process of query-based prompts mainly consists of two key steps: (i) matching the Np mask decoded by the SAM mask decoder with K ground-truth instance masks (usually Np > K); (ii) ) followed by supervised training using the matched labels. When performing optimal transfer matching, considering the predicted class and mask, we define the matching cost as follows:
insert image description here
where ω represents the assignment relationship, while ˆyi and yi correspond to the prediction and the label, respectively. We employ the Hungarian algorithm [81] to identify the optimal assignment between the Np predictions and K targets. The matching cost considers the similarity between predictions and ground-truth annotations. Specifically, it incorporates the class classification matching cost (Lcls), mask cross-entro py cost ( Lseg-ce), and mask dice cost (Lseg-dice).

where ω represents the assignment relationship, and @yi and @yi correspond to predictions and labels, respectively. We use the Hungarian algorithm [81] to determine the optimal allocation between Np predictions and K targets. The matching cost takes into account the similarity between predictions and ground-truth annotations. Specifically, it combines class classification matching cost (Lcls), masked cross-entropy cost (Lseg-ce) and masked dice cost (Lseg-dice).
Once each predicted instance is paired with its corresponding ground truth, we can readily apply the supervision terms. These primarily comprise multi-class classification and binary mask classification, as described below:

Once each predicted instance is paired with its corresponding ground truth, we can easily apply supervision terms. These mainly include multiclass classification and binary mask classification, as described here:
insert image description here
where Lcls denotes the cross-entropy loss computed between the predicted category and the target, while Lseg signs the binary cross-entropy loss between the SAM decoded mask and the matched ground-truth instance mask. Additionally, 1 represents the indicator function.

where Lcls denotes the cross-entropy loss computed between the predicted class and the target, and Lseg denotes the binary cross-entropy loss between the SAM decoded mask and the matched ground-truth instance mask. Among them, 1 represents the indicator function.

Experimental Results and Analysis

implementation Details

The proposed method concentrates on learning to prompt remote sensing image instance segmentation using the SAM foundation model. In our experiments, we employ the ViTHuge backbone of SAM, unless otherwise indicated

The method focuses on learning instance segmentation of remote sensing images based on the SAM base model. In our experiments, we use SAM's ViTHuge backbone unless otherwise stated.

1) Architecture Details:

The SAM framework generates various segmentations for a single prompt; however, our method anticipates only one instance mask for each learned prompt. Consequently, we select the first mask as the final output. For each group of prompts, we set the number of prompts to 4, i.e., kp = 4. In the feature aggregator, to reduce computation costs, we use input features from every 3 layers after the first 8 layers of the backbone, rather than from every layer. For the anchor-based prompter, the RPN network originates from Faster R-CNN [39], and other hyperparameters in the training remain consistent. For the querybased prompter, we employ a 1-layer transformer encoder and a 4-layer transformer decoder, implementing multi-scale training for category prediction from the outputs of the decoder at 4 different levels. However, we do not apply multi-scale training to instance mask prediction in order to maintain efficiency. We determine the number of learnable tokens based on the distribution of target instances in each image, i.e., Np = 90, 60, 30 for the WHU, NWPU, and SSDD datasets, respectively.

1) Architecture details: The SAM framework generates various segmentations for a single cue; however, our method only predicts one instance mask for each learned cue. Therefore, we choose the first mask as the final output. For each set of hints, we set the number of hints to 4, i.e. kp = 4. In the feature aggregator, in order to reduce the computational cost, we use the input features every 3 layers after the first 8 layers of the backbone instead of using the input features every layer. For anchor-based prompters, the RPN network is derived from Faster R-CNN [39], and other hyperparameters in training remain the same. For the query-based prompter, we use a 1-layer Transformer encoder and a 4-layer Transformer decoder, enabling multi-scale training at 4 different levels from the output of the decoder to predict categories. However, to maintain efficiency, we do not apply multi-scale training to instance mask prediction. We determine the number of learnable tokens according to the distribution of target instances in each image, i.e. Np = 90, 60, 30 for WHU, NWPU and SSDD datasets, respectively.

2) Training Details:

During the training phase, we adhere to the image size of 1024 × 1024, in line with SAM’s original input. Concurrently, we utilize horizontal flipping to augment the training samples, without implementing any additional enhancements. Other comparative methods also follow the same settings unless specified otherwise. We only train the parameters of the prompter component while maintaining the parameters of other parts of the network as frozen. During the testing phase, we predict up to 100 instance masks per image for evaluation purposes.

2) Training details: During the training phase, we insist on image size of 1024 × 1024 , which is consistent with the original input of SAM. Meanwhile, we utilize horizontal flipping to augment the training samples without implementing any additional augmentation . Other comparison methods follow the same setup unless otherwise stated. We only train the parameters of the prompter component, while keeping the parameters of the rest of the network frozen. During the testing phase, we predict up to 100 instance masks per image for evaluation purposes.

For optimization, we employ the AdamW optimizer with an initial learning rate of 2e − 4 to train our model. We set the mini-batch size to 24. The total training epochs are 700/1000 for the WHU dataset and 1500/2500 for both the NWPU and SSDD datasets (RSPrompter-anchor/RSPrompterquery). We implement a Cosine Annealing scheduler [92] to decay the learning rate. Our proposed method is developed using PyTorch, and we train all the extra-designed modules from scratch

For optimization, we train our model using the AdamW optimizer with an initial learning rate of 2e−4. We set the mini-batch size to 24 . The total training epoch of the WHU dataset is 700/1000, and the total training epoch is 1500/2500 of the NWPU and SSDD datasets (RSPrompter-anchor/RSPrompterquery). We implement the cosine annealing scheduler [92] to decay the learning rate . Our proposed method is developed using PyTorch, and we train all additionally designed modules from scratch.

Comparison with the State-of-the-Art

In this section, we compare our proposed method with several other state-of-the-art instance segmentation methods.These include multi-stage approaches such as Mask R-CNN [20], Mask Scoring (MS) R-CNN [22], HTC [23], Instaboost [87], PointRend [88], SCNet [90], CA TNet [5], and HQISNet [1], as well as single-stage methods like SOLO [28], SOLOv2 [89], CondInst [27], BoxInst [91], and Mask2Former [29]. Among these, SOLOv2 [89], CondInst [27], BoxInst [91], and Mask2Former [29] are filter-based methods, while CA TNet [5] and HQ-ISNet [1] are Mask R-CNN-based remote sensing instance segmentation methods. For extending instance segmentation methods on SAM, we carry out SAM-seg (Mask R-CNN) and SAM-seg (Mask2Former) for SAM-seg with Mask R-CNN and Mask2Former heads and training regimes.SAM-cls is considered a minimalistic instance segmentation method that leverages the “everything” mode of SAM to obtain all instances in the image and employs a pre-trained ResNet18 [68] to label all instance masks. SAM-det denotes the first training of a Faster R-CNN [39] detector to acquire boxes and subsequently generating corresponding instance masks by SAM with the box prompts. RSPrompter-query and RSPrompter-anchor respectively represent the query-based and anchor-based promoters. All the cited methods are implemented following the official publications using PyTorch

In this section, we compare our proposed method with several other state-of-the-art instance segmentation methods. These methods include multi-stage methods such as Mask R-CNN [20], Mask Scoring (MS) R-CNN [22], HTC [23], Instaboost [87], PointRend [88], SCNet [90], CA TNet [5] and HQISNet[1], and single-stage methods such as SOLO[28], SOLOv2[89], CondInst[27], BoxInst[91] and Mask2Former[29] . Among them, SOLOv2[89], CondInst[27], BoxInst[91], Mask2Former[29] are filter-based methods, CA TNet[5], HQ-ISNet[1] are remote sensing instance segmentation based on Mask r-cnn method. To extend instance segmentation methods on SAM, we performed SAM-seg (Mask R-CNN) and SAM-seg (Mask2Former) Mask R-CNN and Mask 2 former leadership and training regimes on SAM. SAM-cls is considered as a minimalist instance segmentation method, which utilizes the "everything" mode of SAM to obtain all instances in an image, and uses a pre-trained ResNet18 [68] to label all instance masks. SAM-det means that the Faster R-CNN [39] detector is trained for the first time to obtain the box, and then SAM generates the corresponding instance mask according to the box prompt. RSPrompter-query and RSPrompter-anchor represent query-based promoters and anchor-based promoters, respectively. All cited methods are implemented according to official publications using PyTorch.

Quantitative Results on the WHU Dataset: The results of RSPrompter in comparison to other methods on the WHU dataset are presented in Tab. I, with the best performance highlighted in bold. The task involves performing single-class instance segmentation of buildings in optical RGB band remote sensing images. RSPrompter-query attains the best performance for both box and mask pre dictions, achieving APbox and APmask values of 70.36/69.21.Specifically, SAM-seg (Mask2Former) surpasses the original Mask2Former (60.40/62.77) with 67.84/66.66 on APbox and APmask, while SAM-seg (Mask R-CNN) exceeds the original Mask R-CNN (56.11/60.75) with 67.15/64.86. Furthermore, both RSPrompter-query and RSPrompter-anchor improve the performance to 70.36/69.21 and 68.06/66.89, respectively, outperforming SAM-det, which carries out detection before segmentation.

1. Quantitative results on the WHU dataset:

The comparison results between RSPrompter and other methods on the WHU dataset are shown in Table I. The best performers are marked in bold. The task involves one-class instance segmentation of buildings in optical RGB band remote sensing images. RSPrompter-query achieves the best performance in both box and mask prediction, achieving APbox and APmask values ​​of 70.36/69.21. Among them, SAM-seg (Mask2Former) surpassed the original Mask2Former (60.40/62.77) on APbox and APmask with 67.84/66.66, and SAM-seg (Mask R-CNN) surpassed the original Mask R-CNN ( 56.11/60.75). In addition, the performance of RSPrompter-query and RSPrompter-anchor is improved to 70.36/69.21 and 68.06/66.89 respectively, outperforming SAM-det which detects first and then segments .
insert image description here
These observations suggest that the learned-cues approach effectively uses SAMs for the instance segmentation task of optical remote sensing images. Furthermore, they demonstrate that a SAM backbone trained on a wide range of datasets can provide valuable instance segmentation guidance even when fully frozen (as demonstrated by SAM-seg).

2) Quantitative Results on NWPU Dataset:

We conduct comparison experiments on the NWPU dataset to further validate RSPrompter’s effectiveness. Unlike the WHU dataset,his one is smaller in size but encompasses more instance categories, amounting to 10 classes of remote sensing objects.The experiment remains focused on optical RGB band remote sensing image instance segmentation. Tab. II exhibits the overall results of various methods on this dataset.

  1. Quantitative results on the NWPU dataset: We conduct comparative experiments on the NWPU dataset to further verify the effectiveness of RSPrompter. Different from the World Health Organization dataset, although smaller in size, it contains more instance categories, with a total of 10 categories of remote sensing objects. The experiments focus on instance segmentation of remote sensing images in optical RGB bands. Option II presents the overall results of various methods on this dataset.
    It can be observed that RSPrompter-anchor, when compared to other approaches, generates the best results on box and mask predictions (68.54/67.64). In comparison to Mask RCNN-based methods, single-stage methods display a substantial decline in performance on this dataset, particularly the Transformer-based Mask2Former. This may be because the dataset is relatively small, making it challenging for singlestage methods to achieve adequate generalization across the full data domain, especially for Transformer-based methods that require a large amount of training data. Nonetheless, it is worth noting that the performance of SAM-based SAM-seg (Mask2Former) and RSPrompter-query remains impressive. The performance improves from 29.60/35.02 for Mask2Former to 40.56/45.11 for SAM-seg (Mask2Former) and further to 58.79/65.29 for RSPrompter-query.

It can be observed that RSPrompter-anchor achieves the best results (68.54/67.64) on box and mask prediction compared to other methods. Compared with Mask rcnn based methods, the performance of single-stage methods on this dataset drops significantly, especially the transformer-based Mask2Former. This may be because the dataset is relatively small, making it challenging for single-stage methods to generalize adequately across the data domain, especially for transformer-based methods that require large amounts of training data . Nevertheless, it is worth noting that the performance of sam-based SAM-seg (Mask2Former) and RSPrompter-query is still impressive. The performance improves from 29.60/35.02 in Mask2Former to 40.56/45.11 in SAM-seg (Mask2Former) to 58.79/65.29 in RSPrompter-query.
These findings imply that SAM, when trained on a large amount of data, can exhibit significant generalization ability on a small dataset. Even when there are differences in the image domain, SAM's performance can be enhanced through the learning-to-prompt approach.

These findings suggest that SAMs can exhibit significant generalization capabilities on small datasets when trained on large amounts of data. Even when there are differences in the image domain, the performance of SAM can be improved by the "learned hint" method.
insert image description here

4) Qualitative Visual Comparisons:

To facilitate a more effective visual comparison with other methods, we present a qualitative analysis of the segmentation results obtained from SAM-based techniques and other state-of-the-art instance segmentation approaches. Fig. 7, 8, and 9 depict sample segmentation instances from the WHU dataset, NWPU dataset, and SSDD dataset, respectively. It can be observed that the proposed RSPrompter yields notable visual improvements in instance segmentation. Compared to alternative methods, the RSPrompter generates superior results, exhibiting sharper edges, more distinct contours, enhanced completeness, and a closer resemblance to the ground-truth references

4) Qualitative visual comparison: To facilitate a more effective visual comparison with other methods, we perform qualitative analysis on the segmentation results obtained by our SAM-based technique and other state-of-the-art instance segmentation methods. Figure 7, Figure 8, and Figure 9 describe examples of sample segmentation for the WHU dataset, NWPU dataset, and SSDD dataset, respectively. It can be observed that the proposed RSPrompter yields significant visual improvements in instance segmentation. Compared to other methods, RSPrompter produced better results, showing sharper edges, more pronounced contours, enhanced integrity, and closer to ground truth references
insert image description here
insert image description here

insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_41627642/article/details/131549056