SAM【2】:Personalize-SAM


foreword

PerSAMAs a training-free Segment Anything Modelpersonalization method, only one-shot data, i.e. user-provided images and coarse masks are used for efficient customization SAM. But at the same time, PerSAMit is more like an attempt to efficiently select prompts rather than SAMfine-tuning , which PerSAMmay not be a good choice if you want to apply to datasets other than natural images.

Original paper link: Personalize Segment Anything Model with One Shot


1. Abstract & Introduction

1.1. Abstract

Despite SAMits generality, SAMthe problem of tailoring to specific visual concepts without human prompting remains to be explored. This paper SAMproposes a training-free personalization method for , called PerSAM.

Given only one image with a reference mask, the target concept is PerSAMfirst localized by location priors ( ), followed by target-guided attention ( ), target semantic hints ( ), and cascaded post-refinement ( ). A technique to segment target concepts in other images or videos.location priortarget-guided attentiontarget-semantic promptingcascaded post-refinement

To further alleviate mask ambiguity, this paper proposes an efficient variant of single-shot fine-tuning PerSAM-F. On the basis of freezing the whole SAM, we introduce two learnable weights for the multi-scale mask, which improves the performance by only training 2 parameters in 10 seconds.

Furthermore, our method can also be enhanced DreamBoothto generate personalized stable diffusion for text-to-image generation, thereby discarding background noise and better learning target appearance.

1.2. Introduction

SAMis a recently proposed, hintable segmentation paradigm with strong zero-shot capabilities. That is, it takes the user's prompt as input and returns the expected mask. SAMAcceptable cues are generic enough to include points, boxes, masks, and free-form text, so anything in the visual environment can be segmented. For SAMspecific content, you can refer to my other blog: SAM [1]: Segment Anything

However, SAMby itself it is not possible to segment specific visual concepts. For each image, the user needs to locate the target object against different backgrounds, and then perform segmentation based on precise cue activations SAM.

1.2.1. PERSAM

To this end, this paper proposes PerSAM(the specific operation process is shown in the figure below), a personalized Segment Anything Modelmethod that does not require training.

insert image description here

PerSAMEfficient customization using data only once (i.e. user-supplied image and rough mask) SAM. Specifically, an image encoder of , and a given mask is PerSAMfirst utilized to encode the embedding of the target object in the reference image. SAMThen, PerSAMthe feature similarity between the target object and all pixels on the new test image is calculated. On this basis, PerSAMtwo points are selected as positive and negative pairs, encoded as cue markers, and used as SAMposition priors for .

  • Target-guided Attention
    • This paper guides each mapping layer in the decoder of by the tokencalculated feature similaritySAMcross-attention
    • This forces prompt tokens to be mainly concentrated in foreground object regions for efficient feature interaction
  • Target-semantic Prompting
    • In order to better SAMprovide high-level target semantics for , this paper fuses the original low-level cues tokenwith the target object's embedding, which provides the decoder with more sufficient visual cues for personalized segmentation
  • Cascaded Post-refinement
    • In order to obtain finer segmentation results, this paper adopts a two-step post-refinement strategy
    • Using SAMgradually refine the mask it generates, this process only takes an additional 100ms

1.2.2. PerSAM-F

Through the above-mentioned structural design, PerSAMit is possible to provide good personalized segmentation of the target subject in various situations. However, if the subject that needs to be segmented contains a hierarchical structure and there are ambiguous situations, PerSAMit will be difficult to determine a mask of the appropriate size as the segmentation output ( SAM3 segmentation masks with different ratios will be provided as alternatives).

insert image description here

The fine-tuning variant PerSAM-Fcan effectively alleviate this problem by freezing the whole SAMto preserve its pre-trained knowledge and fine-tuning only 2 parameters within 10 seconds.

Specifically, this paper enables SAMto generate multiple segmentation results with different mask scales. To adaptively select the best scale for different objects, we adopt learnable relative weights for each scale and perform weighted summation as the final mask output.

1.2.3. Improve DreamBooth

This paper proposes to PerSAMefficiently segment the target object using , and only apply stable diffusion supervision to foreground regions in a small number of captured images, leading to more diverse and higher fidelity synthesis.

Effectively help to DreamBoothbetter fine-tune Stable Diffusionfor personalized text-to-image generation.

insert image description here


2. Method

2.1. Preliminary

2.1.1. A Revisit of Segment Anything

SAMConsists of three main parts: a hint encoder, an image encoder, and a lightweight mask decoder, which are referred to in this paper as EncP Enc_PEncP E n c I Enc_I EncIand D ec M Dec_MDecM

As a hintable frame, SAMenter image III and a set of tipsPPP (such as foreground or background points, bounding boxes, or rough masks to be refined). SAMFirst useE nc I Enc_IEncIGet the input image features, and then use E nc P Enc_PEncPEncodes hints given by humans into ccc dimensiontokens, ie

insert image description here

The meaning of the parameters is as follows:

  • F I ∈ R h × w × c F_I \in \mathbb{R}^{h \times w \times c} FIRh×w×c
    • It is related to the method of reading data, and the arrangement of dimensions of different methods of reading data will be different ( SAMuse cv2read data in the source code)
  • T P ∈ R k × c T_P \in \mathbb{R}^{k \times c} TPRk×c
  • h , w h, w h,w : Indicates the resolution of the image features
  • k k k : Indicates the length of the prompt

Then, the encoded images and hints are fed into the decoder D ec M Dec_MDecM, for attention-based feature interaction. SAMBy combining multiple learnable markers TM T_MTMPrefixes the hint tokens to build the input tokens for the decoder. These mask tags are responsible for generating the final masked output. The decoding process can be expressed as follows, where MMM isSAMpredictivezero-shot mask

insert image description here

2.1.2. Personalized Segmentation Task

This paper defines a new personalized segmentation task where the user only needs to provide a reference image and a mask representing the target visual concept. A given mask can be either a precise segmented image or a rough sketch drawn by the user online.

2.2. Training-free PerSAM

Training-free PerSAMThe overall process is as follows:

insert image description here

2.2.1. Positive-negative Location Prior

Take user provided image IR I_RIRand mask MR M_RMRAs the condition, PerSAMuse to SAMobtain the target object in the new test image IIThe position prior on I , the specific process is shown in the figure below:

  • One-shot: In addition to the test image, a benchmark image needs to be provided
  • Calculate the similarity after encoding the test image and one-shot image through Image Encoder
  • Select the two points with the highest and lowest similarity as the location prior
    • Highest -> Positive, Lowest -> Negative
  • SAM will tend to segment continuous regions around positive points while discarding negative points on test images

insert image description here

The specific formula is derived as follows

This paper applies SAMthe pre-trained image encoder Extraction III sumIR I_RIRvisual features of

insert image description here

其中, F I , F R ∈ R h × w × c F_I, F_R \in \mathbb{R}^{h \times w \times c} FI,FRRh×w×c

Then, using the reference mask MR ∈ R h × w × 1 M_R \in R^{h \times w \times} 1MRRh × w × 1fromFR F_RFRObtain the characteristics of the pixels in the target visual concept, and use the average pooling method to embed its global vision TR ∈ R 1 × c T_R \in R^{1 \times c}TRR1 × c aggregates to

insert image description here

where ∘ \circ denotes spatial multiplication. Spatial multiplication is often used to merge or modify the pixel values ​​of two images or feature maps. Each element in the resulting matrix or vector is the product of the corresponding elements in the input matrix or vector

With target embedding TR T_RTR, it can be calculated by calculating TR T_RTRwith test image features FI F_IFICosine similarity SS betweenS to get the location confidence map:

insert image description here

In order to SAMprovide the location prior on the test image for , this paper starts from SSIn S , the two pixel coordinates with the highest and lowest similarity values ​​are selected, which are recorded asP h P_hPhand P l P_lPl. The former represents the most likely foreground position of the target object, while the latter represents the background inversely. Then, they are treated as positive and negative point pairs and fed into the hint encoder, i.e.

insert image description here

其中 T P ∈ R 2 × c T_P \in \mathbb{R}^{2 \times c} TPR2 × c asSAMhint marks for decoder

In this way, SAMit tends to segment continuous regions around positive points, while discarding negative points on the test image

2.2.2. Target-guided Attention

This paper further proposes a more explicit SAMway to guide the decoder's cross-attention mechanism, that is, to gather features to focus on the foreground object region.

The similarity map SS calculated in Equation 5S can clearly point out the pixels within the target visual concept on the test image, so this paper usesSSS to condition each token to the attention map in the image cross-attention layer

insert image description here

Through attention bias, tokenwe are forced to capture more visual semantics related to the target object rather than unimportant background. This facilitates more efficient feature interaction in the attention layer and improves the PerSAMfinal segmentation accuracy of

2.2.3. Target-semantic Prompting

The original SAMcan only receive hints carrying low-level location information, such as the coordinates of points or boxes; in order to add more personalized hints, this paper proposes to additionally utilize the visual embedding of the target concept TR T_RTRas PerSAMa high-level semantic hint for

tokensSpecifically, we first operate the target embedding with all inputs in Equation concat2

insert image description here

Then input it into each decoder module as shown in the figure below, and Target-guided Attentionthe position added in each decoder module mentioned in section 2.2.3 is also shown in the figure below

insert image description here

With the help of simple tokenintegration, PerSAMnot only low-level location prior information but also high-level target semantics with auxiliary visual cues can be obtained

2.2.4. Cascaded Post-refinement

With the above technique, the user can SAMobtain the initial segmentation mask of the test image from the decoder, but it may include some rough edges and isolated noise in the background

In order to further improve the segmentation results, this paper sends the initial segmentation mask obtained by the segmentation back SAMto for two-step post-processing:

  • SAMIn the first step, the decoder is hinted with the initial segmentation mask and previous positive and negative position priors
  • In the second step, the bounding box of the mask in the first step is calculated, and this bounding box is used to hint the decoder for more precise object localization

The above post-processing operation only takes 100ms, but iterative refinement can be achieved efficiently

2.3. Fine-tuning of PerSAM-F

2.3.1. Ambiguity of Mask Scales

No-training PerSAMcan handle most cases and achieve satisfactory segmentation accuracy. However, some target objects contain a hierarchical structure, which will generate multiple masks of different scales during segmentation, which makes it PerSAMsometimes impossible to choose a suitable scale mask

insert image description here

As shown in the picture above, the teapot on the top of the platform is composed of two parts: the lid and the body. If the positive prior (indicated by the green star) is on the jug body, but the negative prior (indicated by the red star) does not exclude similarly colored platforms, then there will be PerSAMambiguity in the segmentation

SAMThe solution given is to generate multiple masks of three scales at the same time, corresponding to the whole, part and sub-part of the object. The user then needs to manually select one of the three masks, which works but consumes additional manpower

2.3.2. Learnable Scale Weights

To achieve adaptive segmentation with an appropriate mask scale, this paper introduces a fine-tuned variantPerSAM-F

PerSAM-FReferring first SAMto the solution of , outputs masks of three scales, called M1, M2, and M3. On this basis, this paper adopts two learnable mask weights ( w 1 w_1w1 w 2 w_2 w2), and calculate the final mask output by weighted summation,

insert image description here

Among them, w 1 w_1w1 w 2 w_2 w2The initial value of is 1/3. To learn the optimal weights, we perform a fine-tuning on the reference image and treat the given mask asGT

It should be noted that this paper freezes the entire SAMmodel to preserve its pre-trained knowledge, and only fine-tunes w 1 w_1 in 10 secondsw1 w 2 w_2 w2These two parameters

2.4. Better Personalization of Stable Diffusion

2.4.1. A Revisit of DreamBooth

Fine-tuning a pre-trained text-to-image model (such as stable diffusion) to synthesize images of a specific visual concept specified by the user

But DreamBoothwill calculate the entire reconstructed image and GTof L2 loss, resulting in redundant background information in the generated image, thus covering the newly generated background

2.4.2. PerSAM-assisted DreamBooth

Use PerSAMor PerSAM-Fto segment all foreground objects and discard the gradient backpropagation of pixels in the background region

Only stable diffusionfine-tuning on to memorize the visual appearance of the target object without supervision on the background to maintain its diversity

insert image description here


Summarize

The most prominent contributions of this paper are as follows:

  • Through one-shot, the model no longer needs to generate a prompt from the mask every time, which has a very important reference significance for medical images, which are difficult to obtain image labels.
  • At the same time, the fine-tune model is a very large-scale work. The method proposed in this paper can be customized with only two parameters SAM, which SAMpoints out a new direction for the subsequent fine-tuning work.

reference blog

Guess you like

Origin blog.csdn.net/HoraceYan/article/details/131761288