CVPR 2022 | Segment Everything Everywhere All at Once

image.png

Target task: referring/zero-shot/one-shot segmentation
Target dataset: PhraseCut

main purpose

image.png

Based on the powerful zero-sample text encoding and image encoding capabilities of CLIP, this paper designs a new system to generate image segmentation based on arbitrary prompt information (arbitrary text or image prompts) during testing. The overall form is very similar to Few -Segmentation form of shot.

This strategy solves the limitations of the existing fixed-category-based segmentation learning paradigm, that is, the existing image segmentation task uses a fixed target category to train the model, and it is not flexible enough to expand additional categories or more complex query information later. Conveniently, the model needs to be retrained to include these representations.

Such a design can conveniently build a unified model, which can deal with three common segmentation tasks through one training, namely:

  1. referring expression segmentation: Target segmentation specified by a text prompt. All categories will be seen during training, which means that there is no need to consider the problem of generalization to unknown categories.
  2. Zero-shot segmentation: There are unknown categories during training in the test, and these unknown categories need to be segmented by establishing a relationship with known categories. Existing methods use methods such as category text embedding to realize the association of information.
  3. One-shot segmentation: In the test, the segmentation of the target category needs to provide a single sample associated with it as a hint (support data) to guide the segmentation, which usually can be an image with a corresponding mask.

image.png

Note that the category unknown mentioned here means that the model does not know the possible categories, but the segmentation of the model can be guided by artificially providing the information of the target category.

main content

1681914516361.png

The whole method is built on top of CLIP, using CLIP (ViT-B/16) as the backbone, and extending a small Transformer-based conditional segmentation decoder to achieve dense prediction tasks. As shown in the figure, here the two encodings of CLIP The parameters of the controller are frozen, and only the green part is learnable.

The model uses CLIP's joint text-visual embedding space to guide the learning of the model, ensuring that prompts in both text and image forms can be processed. The idea here is to teach the decoder to associate the activation inside CLIP with the segmentation of the output, while reducing the data as much as possible set biases, and maintain the outstanding and broad predictive power of CLIP.

After training on the extended version of the PhraseCut dataset, the entire model can obtain a binary segmentation map, which corresponds to the query provided by the input terminal, that is, the information of any text or prompt image.

The author mentioned here that this mixed input form enables dynamic adaptation to the three types of segmentation tasks, and also enables the model to adapt to binary segmentation tasks that can be expressed through specific text or images. This actually
reflects It is important to point out that the text description itself has certain subjectivity and ambiguity, and using a specific image as a guide will inevitably be limited by the form of the image and the state of the target. They are used to guide the model for deterministic segmentation. In terms of requirements, it is not fully qualified.

Decoder

The overall process can be seen in the explanation in Figure 2. All CLIP parameters are frozen. The overall model only needs to train a compact interface, so when the feature embedding dimension of CLIPSeg is 64, the overall learnable parameters are only more than 1M.

  • The query image passes through the CLIP visual encoder. The image here obtains an engineered visual prompt by merging the segmentation mask and the image to input into CLIP. The preprocessing process of this query is called visual prompt engineering in the text.
  • The intermediate activation results are read and mapped to the embedding dimension in the decoder. The extracted activations (including CLS tokens) are added before each transformer block in the decoder. The number of decoder transformer blocks and the number of activations extracted in CLIP Consistent. In the actual model, the activation output is obtained from the 3rd, 7th, and 9th layers of CLIP, so the decoder has only three layers.
  • The decoder generates binary segmentation using a linear projection.
  • In order to decode a specific segmentation target, FiLM is used here, and the input of the decoder is modulated by the conditional vector.
    • FiLM comes from https://arxiv.org/pdf/1709.07871.pdf .
    • The conditional vector can be obtained from two ways, that is, using the CLIP text encoder to embed the text Query and using the CLIP visual encoder to embed the image Query.
    • In fact, since CLIP builds a shared embedding space between images and text, the authors use random linear interpolation between the two as the conditional vector, and use this method as a data enhancement during training Strategy (image-text interpolation). That is, using image embedding S i S_iSiand text embedding ti t_itiLinear combination to obtain conditional vector: xi = α S i + ( 1 − α ) ti , α ∈ [ 0 , 1 ] x_i = \alpha S_i + (1-\alpha)t_i, \alpha \in [0, 1]xi=αSi+(1a ) ti,a[0,1].

PhraseCut+

PhraseCut originally only had 340,000 phrases with corresponding image segmentation, and did not contain any visual support. The authors expanded this data set:

  • In order to add a visual support image for prompt p, some data are randomly selected from the sample set Sp, and they share prompt p. For the prompt corresponding to only a single sample, the model only depends on the text prompt.
  • In addition, a negative sample is introduced into the data set, that is, there is no matching target in the prompt in the sample. Specifically, the corresponding phrase of the sample is randomly replaced with a different phrase according to a fixed probability qneg.

Phrases are augmented randomly using a fixed set of prefixes (as suggested by the authors of CLIP). On images, we apply random cropping taking into account object positions, ensuring that objects remain at least partially visible. This paper refers to this extended dataset as PhraseCut+ (abbreviated as PC+).

Contrary to the original PhraseCut dataset that only uses text to specify the target, PC+ supports training using image-text interpolation. In this way, a joint model can be trained to process text and visual input.

It is not mentioned here, how to interpolate when inputting text and images at the same time during the test?

Visual Prompt Engineering

Prompt-based learning actually aims to introduce specific target information as a guide. Therefore, how to obtain specific target information is also a key. Here we discuss it from two perspectives: feature level and input level.

image.png

In order to evaluate different masking strategies, here are some evaluations based on the alignment relationship of CLIP on the LVIS dataset.

Use CLIP to calculate the text embedding ti corresponding to the target name in the image. Then compare them with the original image visual embedding so, and the visual embedding sh after using the modified image or attention mask to highlight the target object, respectively. By softmax Normalize the alignment vector (which can be understood as pairwise similarity), and the resulting distribution is shown in Figure 3.

Here, the evaluation indicators of different masking strategies are defined as Δ P ( object ) = sht 0 − sot 0 \Delta P(object) = s_h t_0 - s_o t_0ΔP(object)=sht0sot0, where the text embedding of the target category is defined as t 0 t_0t0. So here represents the similarity between different image embeddings and target text embeddings, relative to the difference in the corresponding similarity of the original image.

Feature-level target-specific information extraction: Masked Pooling

The authors first discussed the Masked Pooling technology that is often used in order to calculate the prototype vector in the traditional convolution-based scheme. This technology acts on the support image features. In this paper, the process of extracting prompt information for specific target information is the same as this The concepts are quite similar.

However, since the overall structure is based on Transformer, the semantic information not only exists scattered in the feature map, but also includes the CLS token that participates in the hierarchical aggregation process. It is also impossible to bypass the CLS token and directly derive the condition vector from the mask pooling of the feature map Yes, because it would break compatibility between CLIP's text embeddings and visual embeddings.

In Transformer, the direct method equivalent to masked pooling is to apply mask directly on tokens. Usually, visual transformer consists of a fixed set of tokens, which can interact at each layer through multi-head attention. Among them are readout The CLS token and the token associated with the image region originally obtained from the image patch.

image.png

The mask can limit one (such as the last layer) or multiple transformer layers to the interaction between the patch token in the mask and the CLS token only. The experimental results are shown in the left table of Table 2. You can see this masking The effect of the strategy is average, and it has a slight improvement compared to full masking.

Target-Specific Information Extraction at the Input Level: Visual Prompt Engineering

Instead of using masking inside the model, you can directly combine mask and image to obtain a new image, which can be processed by visual transformer. Similar to (analogous to) prompt engineering in NLP, it is called visual prompt engineering here. Due to this prompt design, which is novel in NLP and performs best when the context is unknown, here a variety of forms are designed to evaluate different variants, as shown in Table 2. The form of combining mask and image can be found Very important. Three image operations are identified here to improve the alignment between the text prompt and the image: reducing the background brightness, using a Gaussian filter to blur the background, and cropping the target. As can be seen in Table 2, the combination of the three operations obtains for best results. This article uses this form.

Experiment Details

In the experiment, due to the problem of binarization of single-channel predicted images, the author made some special settings for the threshold setting: In binary segmentation, IoU needs to specify a threshold. Although in most cases would use a natural choice of 0.5, but the optimal value may deviate significantly from 0.5 if the probability of an object matching the query differs between training and inference (the prior probability of one or more objects in a query matching scenario is highly dependent on content and dataset). Therefore, this paper reports the performance of one-shot segmentation optimized for each task and model using thresholding.

Two baseline variants are designed in the experiments here:

  • CLIP-Deconv uses CLIP and a very simple decoder, consisting only of FiLM conditioning structures, a linear map, and a transposed convolution. This helps estimate the contribution of CLIP alone to the result.
  • ViTSeg uses the same structure as the proposed CLIPSeg, but here the visual encoder uses ImageNet pre-trained parameters to replace the initialization of CLIP. The text encoder still uses CLIP initialization. This helps to understand the specific CLIP weights on performance. Influence.

Refering Segmentation

image.png

You can see in the comparison:

  • Different models have different optimal thresholds on different datasets.
  • The performance of ViTSeg reflects that it is more important to use both encoders with CLIP at the same time.
  • In the Refering segmentation, the effect of using the expanded data set is not good. The effect of only text-based training is better.

Zero-shot Segmentation

image.png

Here is a direct evaluation, trained on PhraseCat+ with the exclusion of unseen classes in Pascal. This is done by assigning Pascal classes to the WordNet synset and generating a set of set of invalid words. Hints containing such a word will be dropped from the dataset.

The author mentioned here that the Pascal-VOC data set that the experiment focuses on is multi-label. The proposed model is in the form of binary prediction, so for this multi-label requirement, it is directly for 20 Pascal categories Predict an independent binary image respectively. From all 20 predictions, predict the class with the highest probability for each pixel.

It may be understood that each image contains more than one category, so in fact this is a multi-category task. For the setting of this article, each input may not know which categories exist in advance, so directly generate 20 at a time category predictions.

A very interesting point in the experimental results is that the proposed CLIPSeg performs better on the unseen class than on the seen class. The authors task this may be because the actual seen class is more difficult than the unseen class. For example, for the design of unseen-4 It is determined that the four classes are airplane/cow/motorbike/sofa, and most of them are large and relatively unique targets.

image.png

One-shot Segmentation

In this type of task, the recognition of the target needs to rely on the prompt image and the corresponding mask, so it cannot rely on the text label.

Here the three operations mentioned above are used to process the hint image to highlight the target.

Here the same strategy is used to remove classes that overlap with Pascal.

image.png

image.png

Open Vocabulary

image.png

Here we explore the prediction effect for three different text prompts.

  • Generalized Prompts: 亦为affordance-like prompt.
  • Descriptive Prompts from PhraseCut test set: The yellow area in the figure above additionally tests the situation described with the wrong color.

The detailed information (in the blue box) and color information (in the orange box) in the prompt have a great influence on the judgment of the predicted object.

image.png

In order to quantitatively evaluate the performance of the generalized prompt, the author constructed a subset of the LVIS test data set, which only contains images corresponding to the class of affordance or attribute. Then the model is required to use these as prompts for segmentation. For example, when using "sit on "As a hint, compute the foreground IoU with the armchair, sofa, and loveseat objects. A full list of class mappings can be found in the appendix of the paper.

Experiments find that the version of CLIPSeg trained on PC+ performs better than the CLIP-Deconv baseline, and the version of CLIPSeg trained on LVIS (which only contains object labels instead of complex phrases).

This result suggests that both dataset variability and model complexity are necessary for generalization.

ViTSeg performs worse, which is expected since it does not use the powerful CLIP backbone known for its generalization ability.

Ablation Study

image.png

It can be seen from the figure that the pre-training weight of CLIP is very important for the proposed CLIPSeg. The training of the mixed prompt form makes the independent performance of the model for different inputs more balanced.

Dependency analysis of CLIP on input size

Since multi-head attention does not require a fixed number of labels, the transformer in CLIP can handle input of arbitrary size. However, the publicly available CLIP models (ViT-B/16 and ViT-B/32) are trained on 224×224 images . Here the CLIP performance is measured in a classification task as a function of the input image size.

From the two CLIP encoders, extract the CLS vector of the last layer. Using this vector as input, train a logistic regression classifier on the ImageNet subset to distinguish 67 classes of vehicles.

image.png

The results show that CLIP generally handles large image sizes well, with the ViT-B/16 version showing better performance at a size of about 350×350.

Text prompt form, target size and category

image.png

In order to better understand the performance of the model, different forms of text prompts, target sizes and categories are compared here.

Experiments are performed with CLIPSeg pre-trained on PC+. In all cases different prompt forms are randomly sampled during training. Here the performance on the PhraseCut test set of 5000 samples is evaluated.

  • Several prompt forms have little impact on performance.
  • There is a clear trend towards better performance on larger objects.
  • The performance of the different classes is fairly balanced.

Guess you like

Origin blog.csdn.net/P_LarT/article/details/130298712