Article Directory
foreword
PerSAM
As a training-free Segment Anything Model
personalization method, only one-shot data, i.e. user-provided images and coarse masks are used for efficient customization SAM
. But at the same time, PerSAM
it is more like an attempt to efficiently select prompts rather than SAM
fine-tuning , which PerSAM
may not be a good choice if you want to apply to datasets other than natural images.
Original paper link: Personalize Segment Anything Model with One Shot
1. Abstract & Introduction
1.1. Abstract
Despite SAM
its generality, SAM
the problem of tailoring to specific visual concepts without human prompting remains to be explored. This paper SAM
proposes a training-free personalization method for , called PerSAM
.
Given only one image with a reference mask, the target concept is PerSAM
first localized by location priors ( ), followed by target-guided attention ( ), target semantic hints ( ), and cascaded post-refinement ( ). A technique to segment target concepts in other images or videos.location prior
target-guided attention
target-semantic prompting
cascaded post-refinement
To further alleviate mask ambiguity, this paper proposes an efficient variant of single-shot fine-tuning PerSAM-F
. On the basis of freezing the whole SAM
, we introduce two learnable weights for the multi-scale mask, which improves the performance by only training 2 parameters in 10 seconds.
Furthermore, our method can also be enhanced DreamBooth
to generate personalized stable diffusion for text-to-image generation, thereby discarding background noise and better learning target appearance.
1.2. Introduction
SAM
is a recently proposed, hintable segmentation paradigm with strong zero-shot capabilities. That is, it takes the user's prompt as input and returns the expected mask. SAM
Acceptable cues are generic enough to include points, boxes, masks, and free-form text, so anything in the visual environment can be segmented. For SAM
specific content, you can refer to my other blog: SAM [1]: Segment Anything
However, SAM
by itself it is not possible to segment specific visual concepts. For each image, the user needs to locate the target object against different backgrounds, and then perform segmentation based on precise cue activations SAM
.
1.2.1. PERSAM
To this end, this paper proposes PerSAM
(the specific operation process is shown in the figure below), a personalized Segment Anything Model
method that does not require training.
PerSAM
Efficient customization using data only once (i.e. user-supplied image and rough mask) SAM
. Specifically, an image encoder of , and a given mask is PerSAM
first utilized to encode the embedding of the target object in the reference image. SAM
Then, PerSAM
the feature similarity between the target object and all pixels on the new test image is calculated. On this basis, PerSAM
two points are selected as positive and negative pairs, encoded as cue markers, and used as SAM
position priors for .
Target-guided Attention
- This paper guides each mapping layer in the decoder of by the
token
calculated feature similaritySAM
cross-attention
- This forces prompt tokens to be mainly concentrated in foreground object regions for efficient feature interaction
- This paper guides each mapping layer in the decoder of by the
Target-semantic Prompting
- In order to better
SAM
provide high-level target semantics for , this paper fuses the original low-level cuestoken
with the target object'sembedding
, which provides the decoder with more sufficient visual cues for personalized segmentation
- In order to better
Cascaded Post-refinement
- In order to obtain finer segmentation results, this paper adopts a two-step post-refinement strategy
- Using
SAM
gradually refine the mask it generates, this process only takes an additional 100ms
1.2.2. PerSAM-F
Through the above-mentioned structural design, PerSAM
it is possible to provide good personalized segmentation of the target subject in various situations. However, if the subject that needs to be segmented contains a hierarchical structure and there are ambiguous situations, PerSAM
it will be difficult to determine a mask of the appropriate size as the segmentation output ( SAM
3 segmentation masks with different ratios will be provided as alternatives).
The fine-tuning variant PerSAM-F
can effectively alleviate this problem by freezing the whole SAM
to preserve its pre-trained knowledge and fine-tuning only 2 parameters within 10 seconds.
Specifically, this paper enables SAM
to generate multiple segmentation results with different mask scales. To adaptively select the best scale for different objects, we adopt learnable relative weights for each scale and perform weighted summation as the final mask output.
1.2.3. Improve DreamBooth
This paper proposes to PerSAM
efficiently segment the target object using , and only apply stable diffusion supervision to foreground regions in a small number of captured images, leading to more diverse and higher fidelity synthesis.
Effectively help to DreamBooth
better fine-tune Stable Diffusion
for personalized text-to-image generation.
2. Method
2.1. Preliminary
2.1.1. A Revisit of Segment Anything
SAM
Consists of three main parts: a hint encoder, an image encoder, and a lightweight mask decoder, which are referred to in this paper as EncP Enc_PEncP、 E n c I Enc_I EncIand D ec M Dec_MDecM
As a hintable frame, SAM
enter image III and a set of tipsPPP (such as foreground or background points, bounding boxes, or rough masks to be refined). SAM
First useE nc I Enc_IEncIGet the input image features, and then use E nc P Enc_PEncPEncodes hints given by humans into ccc dimensiontokens
, ie
The meaning of the parameters is as follows:
- F I ∈ R h × w × c F_I \in \mathbb{R}^{h \times w \times c} FI∈Rh×w×c
- It is related to the method of reading data, and the arrangement of dimensions of different methods of reading data will be different (
SAM
usecv2
read data in the source code)
- It is related to the method of reading data, and the arrangement of dimensions of different methods of reading data will be different (
- T P ∈ R k × c T_P \in \mathbb{R}^{k \times c} TP∈Rk×c
- h , w h, w h,w : Indicates the resolution of the image features
- k k k : Indicates the length of the prompt
Then, the encoded images and hints are fed into the decoder D ec M Dec_MDecM, for attention-based feature interaction. SAM
By combining multiple learnable markers TM T_MTMPrefixes the hint tokens to build the input tokens for the decoder. These mask tags are responsible for generating the final masked output. The decoding process can be expressed as follows, where MMM isSAM
predictivezero-shot mask
2.1.2. Personalized Segmentation Task
This paper defines a new personalized segmentation task where the user only needs to provide a reference image and a mask representing the target visual concept. A given mask can be either a precise segmented image or a rough sketch drawn by the user online.
2.2. Training-free PerSAM
Training-free PerSAM
The overall process is as follows:
2.2.1. Positive-negative Location Prior
Take user provided image IR I_RIRand mask MR M_RMRAs the condition, PerSAM
use to SAM
obtain the target object in the new test image IIThe position prior on I , the specific process is shown in the figure below:
- One-shot: In addition to the test image, a benchmark image needs to be provided
- Calculate the similarity after encoding the test image and one-shot image through Image Encoder
- Select the two points with the highest and lowest similarity as the location prior
- Highest -> Positive, Lowest -> Negative
- SAM will tend to segment continuous regions around positive points while discarding negative points on test images
The specific formula is derived as follows
This paper applies SAM
the pre-trained image encoder Extraction III sumIR I_RIRvisual features of
其中, F I , F R ∈ R h × w × c F_I, F_R \in \mathbb{R}^{h \times w \times c} FI,FR∈Rh×w×c
Then, using the reference mask MR ∈ R h × w × 1 M_R \in R^{h \times w \times} 1MR∈Rh × w × 1fromFR F_RFRObtain the characteristics of the pixels in the target visual concept, and use the average pooling method to embed its global vision TR ∈ R 1 × c T_R \in R^{1 \times c}TR∈R1 × c aggregates to
where ∘ \circ∘ denotes spatial multiplication. Spatial multiplication is often used to merge or modify the pixel values of two images or feature maps. Each element in the resulting matrix or vector is the product of the corresponding elements in the input matrix or vector
With target embedding TR T_RTR, it can be calculated by calculating TR T_RTRwith test image features FI F_IFICosine similarity SS betweenS to get the location confidence map:
In order to SAM
provide the location prior on the test image for , this paper starts from SSIn S , the two pixel coordinates with the highest and lowest similarity values are selected, which are recorded asP h P_hPhand P l P_lPl. The former represents the most likely foreground position of the target object, while the latter represents the background inversely. Then, they are treated as positive and negative point pairs and fed into the hint encoder, i.e.
其中 T P ∈ R 2 × c T_P \in \mathbb{R}^{2 \times c} TP∈R2 × c asSAM
hint marks for decoder
In this way, SAM
it tends to segment continuous regions around positive points, while discarding negative points on the test image
2.2.2. Target-guided Attention
This paper further proposes a more explicit SAM
way to guide the decoder's cross-attention mechanism, that is, to gather features to focus on the foreground object region.
The similarity map SS calculated in Equation 5S can clearly point out the pixels within the target visual concept on the test image, so this paper usesSSS to condition each token to the attention map in the image cross-attention layer
Through attention bias, token
we are forced to capture more visual semantics related to the target object rather than unimportant background. This facilitates more efficient feature interaction in the attention layer and improves the PerSAM
final segmentation accuracy of
2.2.3. Target-semantic Prompting
The original SAM
can only receive hints carrying low-level location information, such as the coordinates of points or boxes; in order to add more personalized hints, this paper proposes to additionally utilize the visual embedding of the target concept TR T_RTRas PerSAM
a high-level semantic hint for
tokens
Specifically, we first operate the target embedding with all inputs in Equation concat
2
Then input it into each decoder module as shown in the figure below, and Target-guided Attention
the position added in each decoder module mentioned in section 2.2.3 is also shown in the figure below
With the help of simple token
integration, PerSAM
not only low-level location prior information but also high-level target semantics with auxiliary visual cues can be obtained
2.2.4. Cascaded Post-refinement
With the above technique, the user can SAM
obtain the initial segmentation mask of the test image from the decoder, but it may include some rough edges and isolated noise in the background
In order to further improve the segmentation results, this paper sends the initial segmentation mask obtained by the segmentation back SAM
to for two-step post-processing:
SAM
In the first step, the decoder is hinted with the initial segmentation mask and previous positive and negative position priors- In the second step, the bounding box of the mask in the first step is calculated, and this bounding box is used to hint the decoder for more precise object localization
The above post-processing operation only takes 100ms, but iterative refinement can be achieved efficiently
2.3. Fine-tuning of PerSAM-F
2.3.1. Ambiguity of Mask Scales
No-training PerSAM
can handle most cases and achieve satisfactory segmentation accuracy. However, some target objects contain a hierarchical structure, which will generate multiple masks of different scales during segmentation, which makes it PerSAM
sometimes impossible to choose a suitable scale mask
As shown in the picture above, the teapot on the top of the platform is composed of two parts: the lid and the body. If the positive prior (indicated by the green star) is on the jug body, but the negative prior (indicated by the red star) does not exclude similarly colored platforms, then there will be PerSAM
ambiguity in the segmentation
SAM
The solution given is to generate multiple masks of three scales at the same time, corresponding to the whole, part and sub-part of the object. The user then needs to manually select one of the three masks, which works but consumes additional manpower
2.3.2. Learnable Scale Weights
To achieve adaptive segmentation with an appropriate mask scale, this paper introduces a fine-tuned variantPerSAM-F
PerSAM-F
Referring first SAM
to the solution of , outputs masks of three scales, called M1, M2, and M3. On this basis, this paper adopts two learnable mask weights ( w 1 w_1w1、 w 2 w_2 w2), and calculate the final mask output by weighted summation,
Among them, w 1 w_1w1、 w 2 w_2 w2The initial value of is 1/3. To learn the optimal weights, we perform a fine-tuning on the reference image and treat the given mask asGT
It should be noted that this paper freezes the entire SAM
model to preserve its pre-trained knowledge, and only fine-tunes w 1 w_1 in 10 secondsw1、 w 2 w_2 w2These two parameters
2.4. Better Personalization of Stable Diffusion
2.4.1. A Revisit of DreamBooth
Fine-tuning a pre-trained text-to-image model (such as stable diffusion
) to synthesize images of a specific visual concept specified by the user
But DreamBooth
will calculate the entire reconstructed image and GT
of L2 loss
, resulting in redundant background information in the generated image, thus covering the newly generated background
2.4.2. PerSAM-assisted DreamBooth
Use PerSAM
or PerSAM-F
to segment all foreground objects and discard the gradient backpropagation of pixels in the background region
Only stable diffusion
fine-tuning on to memorize the visual appearance of the target object without supervision on the background to maintain its diversity
Summarize
The most prominent contributions of this paper are as follows:
- Through one-shot, the model no longer needs to generate a prompt from the mask every time, which has a very important reference significance for medical images, which are difficult to obtain image labels.
- At the same time, the fine-tune model is a very large-scale work. The method proposed in this paper can be customized with only two parameters
SAM
, whichSAM
points out a new direction for the subsequent fine-tuning work.