Paper Reading: RSPrompter, remote sensing instance segmentation prompt learning based on visual basic model

Insert image description here

Introduction

Title: "RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Model", remote sensing instance segmentation prompt learning based on visual foundation model

Date: 2023.6.28

Unit: Beihang University, Beijing Key Laboratory of Digital Media, State Key Laboratory of Virtual Reality Technology and Systems, Shanghai Artificial Intelligence Laboratory

Paper address: https://arxiv.org/abs/2306.16269

GitHub:https://github.com/KyanChen/RSPrompter

author:

Chen Keyan

Personal homepage: https://kyanchen.github.io/Google
Insert image description here
Scholar
Insert image description here
Other authors


  • Abstract
    Utilizing a large amount of training data (SA-1B), the basic segmentation arbitrary model (SAM) proposed by Meta AI Research shows excellent generalization and zero-shot capabilities. Nonetheless, as a class-agnostic instance segmentation method, SAM relies heavily on prior manual guidance involving points, boxes, and coarse-grained masks. In addition, its performance on remote sensing image segmentation tasks has yet to be fully explored and demonstrated. In this paper, we consider designing an automatic segmentation method for remote sensing image instances based on the SAM basic model and combined with semantic classification information . Inspired by prompt learning, we propose a method to learn to generate appropriate prompts as input to SAM . This enables SAM to produce semantically discernible segmentation results for remote sensing images, which we call RSPrompter . We also propose several ongoing derivatives for instance segmentation tasks based on recent developments in the SAM community and compare their performance with RSPrompter. Extensive experimental results on WHU building, NWPU VHR-10 and SSDD data sets verify the effectiveness of our proposed method.

Target

  • Background
    Due to its interactive framework, SAM needs to provide a priori prompts, such as points, boxes or masks to behave as a class-independent segmentation method, as shown in Figure (a) below. Obviously, these limitations make SAM unsuitable for fully automatic interpretation of remote sensing images.
    Insert image description here

(a) Describes the instance segmentation results of point-based prompts, box-based prompts, SAM’s “everything” mode (segmenting all objects in the image), and RSPrompter. SAM performs class-independent instance segmentation, relying on manually provided previous hints. (b) The segmentation results of point prompts, two-point prompts and box prompts at different positions are given. The type, location, and number of cues strongly influence the results of SAM.

Furthermore, we observe that complex background interference and lack of well-defined object edges in remote sensing image scenes pose significant challenges to the segmentation capabilities of SAM. It is difficult for SAM to achieve complete segmentation of remote sensing image targets, and its results heavily depend on prompt type, location and number. In most cases, careful manual prompting is crucial to achieve the desired effect, as shown in (b) above. This indicates that SAM has considerable limitations when applied to instance segmentation of remote sensing images.

  • Goal motivation:
    Enhance the ability of SAM on image segmentation tasks. Each group of prompts can get an instantiation mask. If multiple category-related prompts can be automatically generated, the SAM decoder can generate multiple instance-level masks with category labels. Therefore, this article proposes RSPrompter to learn how to generate prompts that can enhance the capabilities of the SAM framework.
    in,
    1. Category-related prompt source: extract features of the middle layer of SAM ViT backbone and input them into a lightweight feature aggregator
    2. The output form of the generated prompt is prompt embeddings (coordinates are not generated. The author believes that generating coordinates will limit the optimization space; it also avoids the obstacle of gradient flow from high dimension to low dimension and back to high dimensional features, that is, from high dimensional image features to point coordinates, and then to position encoding.)

MLA

  1. An automated instance segmentation method that simultaneously incorporates semantic information
  2. SAM-based prompt project
  3. Conducted research on the SAM community and proposed some variants on the strength segmentation task of SAM
  4. In terms of experiments, three remote sensing data sets were used for verification (there are some differences in data volume, data category, and modality)

method

Insert image description hereDescribes the schematic diagram of SAM, which consists of an image encoder, a cue encoder and a mask decoder. SAM generates corresponding object masks based on the input prompts provided.

In addition to the RSPrompter proposed in this paper, three other SAM-based instance segmentation methods are also introduced for comparison, as shown in Figure 3 (a), (b) and ©, to evaluate their effectiveness in the remote sensing image instance segmentation task. properties and provide inspiration for future research. These methods include: external instance segmentation head, classifying mask categories, and using detected target boxes, corresponding to Figure 3 (a), (b), (c) respectively. In the following sections, we will refer to these methods as SAMseg, SAM-cls, and SAM-det respectively.
Insert image description here

The figure shows SAM-seg, SAM-cls, SAM-det and RSPrompter respectively from left to right as alternative solutions for applying SAM to remote sensing image instance segmentation tasks. (a) Adding an instance segmentation header after the SAM image encoder. (b) SAM’s “Everything” mode generates masks for all objects in an image, which are subsequently classified by a classifier into specific categories. ©The object bounding box is first generated by the object detector, which is then input as a priori hint to SAM to obtain the corresponding mask. (d) The proposed RSPrompter in this paper creates category-related hint embeddings for on-the-fly segmentation masks. The snowflake symbol in the figure indicates that the model parameters of this part are frozen.

  • SAM generates the procedural expression of mask:

Insert image description here

  • SAM-seg

Insert image description here

SAM-seg exploits existing knowledge of the SAM image encoder while keeping the encoder unchanged. It extracts mid-layer features from the encoder, uses convolutional blocks for feature fusion, and then uses existing instance segmentation (Mask R-CNN and Mask2Former) to perform the instance segmentation task.

  • SAM-cls

Insert image description here

In SAM-cls, SAM's "full image" mode is first utilized to segment all potential instance objects in the image. This is achieved by evenly distributing points throughout the image and treating each point as a prompt input to the instance. After you obtain all instance masks in an image, you can use a classifier to assign a label to each mask.

For convenience, this article directly uses lightweight ResNet18 to mark the mask. Secondly, pre-trained CLIP models can be utilized to enable SAM-cls to run without additional training to achieve zero-sample results.

  • SAM-it

Insert image description here

The SAM-det method is simpler and more direct and has been widely adopted by the community. An object detector is first trained to identify the desired objects in the image, and then the detected bounding boxes are input into SAM as prompts.


  • RSPrompter

Insert image description here

The image is processed by the frozen SAM image encoder to generate F img , {F i } is some features rich in semantic information (middle layer) extracted from the backbone; {F i } is passed through a lightweight feature aggregator Φaggregator, Obtain a dense feature map F agg ; F agg input prompter, generate multiple groups of prompt imbedding (T j ) and corresponding categories (c j ); finally, T j is input into the mask decoder to generate instance mask

  • Feature Aggregator

Insert image description here

As shown in the figure, the proposed lightweight feature aggregator extracts semantic information from the large ViT backbone and performs a lightweight fusion process.

Insert image description here

Downsample the semantic features F i of various intermediate feature layers extracted from the ViT backbone : 64×64×1280–>32×32×32; enable information to flow through residual connections; finally obtain by fusion convolution Φ FusionConv Dense feature F agg

Two different types of prompts

  • Anchor-based Prompter, anchor type

Insert image description here

Use RPN head to recall targets in dense features and generate some proposals; proposals generate some visual vectors through RoI Pooling, and then pass through three perception heads: semantic head, positioning head and prompt head. Used to determine the target category, establish the matching criterion (IoU) between the generated prompt representation and the target instance mask, and generate prompt imbedding

Insert image description here

In the process of generating prompt imbedding, a sin transformation is passed in order to align the prompt encoder of SAM and the space of the generated prompt imbedding (the original prompt encoder is a high-frequency signal, and the prompt generated by mlp is a For stationary signals, use the sin function to map low frequencies to high frequencies to align the two expression spaces)

Loss Function: The losses of this model include the binary classification loss and localization loss of the RPN network, the classification loss of the semantic head, the regression loss of the localization head, and the segmentation loss of the frozen SAM mask decoder. The total loss can be expressed as:
Insert image description here

  • Query-based Prompter, query type

Insert image description here
Insert image description here

Loss Function:

The training process mainly involves two key steps:

(1) N masks are matched against k ground-truths (using the Hungarian matching algorithm)

Insert image description here

(2) Supervised training (mainly including multi-class classification and binary mask classification)
Insert image description here

experiment

  • data set

    1. WHU building extraction dataset:1 class, RGB,5K, training
    2. NWPU VHR-10 dataset:10 clas, RGB,600 training
    3. SAR Ship Detection dataset:1 class, SAR,900 training

    Three public remote sensing instance segmentation datasets: WHU building extraction dataset, NWPU VHR-10 dataset and SSDD dataset. The WHU data set is single-class building target extraction and segmentation, NWPU VHR-10 is multi-class target detection and segmentation, and SSDD is SAR ship target detection and segmentation.

  • Evaluation index: mAP (box & mask)

  • Comparison with the SOTA: WHU
    Insert image description here

The table gives the comparison between the proposed method and other state-of-the-art methods on the whu dataset. It shows AP(%) values ​​of box and mask at different thresholds

  • Comparison with the NWPU:
    Insert image description here

  • Comparison with the SOTA: SSDD
    Insert image description here

Observe Tab1-3: (1) AP is significantly improved; (2) It has strong generalization on small data sets and different domains; (3) Prompters based on anchor and query have different performance on different data sets. Performance (query is better than anchor on medium and large data sets)


  • ablation experiment

Insert image description here
Various image encoders and their corresponding number of parameters are given, as well as their segmentation performance on the nwpu dataset.

Insert image description here
The impact on segmentation performance of merging different levels of features in SAM's backbone into feature aggregators is highlighted. The notation [start: end: step] specifies the index of the feature map returned from start to end in step intervals.

Insert image description here
Effects of down-turning and residual connections in feature aggregators on segmentation performance. The first line describes the final approach taken. Rs: reduce spatial dimension; Rc: reduce channel size; ARC: add residual connections; Pc: parallel architecture with characteristic connections.

Insert image description here
Shows the effect of varying the number of transformer encoder and decoder layers in the prompter on segmentation performance.

Insert image description hereHighlights the impact of different number of queries and number of hint embeddings in the prompter on segmentation performance

Insert image description here
Demonstrates the impact of sinusoidal regularization on prompters, adding additional trainable components to the mask decoder, and employing a multi-scale training mechanism on segmentation performance

Summarize

  • conclusionIn
    this paper, we introduced RSPrompter, a prompt learning method for remote sensing image instance segmentation that utilizes the SAM basic model. The goal of RSPrompter is to learn how to generate prompt input for SAM so that it can automatically obtain semantic instance-level masks. In contrast, the original SAM requires additional manual prompts and is a category-independent segmentation method. The design concept of RSPrompter is not limited to SAM models, but can also be applied to other basic models. Based on this concept, we designed two specific implementation solutions: RSPrompter-anchor based on preset anchor points and RSPrompter-query based on query and optimal transmission matching. Furthermore, we survey and propose various methods and variants for this task in the SAM community and compare them with our prompt learning method. The effectiveness of each component in RSPrompter was verified through ablation experiments. Meanwhile, experimental results on three public remote sensing datasets show that our method outperforms other state-of-the-art instance segmentation techniques, as well as some SAM-based methods.

  • discussions

    1. The decoder has a large amount of calculation: consider redesigning the head
    2. Query-based prompter is direct, lightweight, and performs better on medium and large data sets, but its convergence speed is slow. Consider optimization.
    3. When the data set is small, using pompt learning on large models will have better performance.

Guess you like

Origin blog.csdn.net/Transfattyacids/article/details/132909195