CLIP is also an effective segmenter: a text-driven approach for weakly supervised semantic segmentation

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

Summary

Paper Link
Code Link

  • This paper explores the potential of CLIP to localize different categories using only image-level labels and without further training
  • A new framework CLIP-ES is proposed:
  1. We introduce the softmax function in GradCAM and utilize the zero-shot capability of CLIP to suppress the clutter caused by non-target classes and background. and re-explored text input, and customized two text-driven strategies: articulation-based prompt selection and synonym fusion
  2. To simplify the refinement stage of CAM, we propose a real-time Class-Aware Attention Affinity (CAA) module based on Multi-Head Self-Attention (MHSA) inherent in CLIP-ViTs
  3. When training the final segmentation model using the masks generated by CLIP, we introduce a confidence guided loss (CGL) to focus on confident regions.
    insert image description here

method

insert image description here
The finishing framework is shown in the figure above: the softmax function is introduced in GradCAM, and a class-related background set is defined to make the classes mutually exclusive. K and M represent the number of categories in the image and background collections, respectively. The initial cam is generated by Grad-CAM and combines well text (e.g., prompt selection, synonym fusion). A CAA module based on transformer intrinsic MHSA is proposed to refine initial CAMs in real time. The entire CAM generation process does not require training, and CGL ignores noise locations when calculating losses based on confidence maps.

Softmax-GradCAM

Class Activation Mapping (CAM) is widely used to identify discriminative regions of target classes through weighted combination of feature maps. However, it is only applicable to specific CNN architectures, e.g. models with a Global Average Pooling (GAP) layer immediately after the feature map. GradCAM uses gradient information to combine feature maps, so no network architecture is required. For the original GradCAM, the class feature weight can be calculated as:
insert image description here
where wck is the weight of the k-th feature map corresponding to the c-th class, Z is the number of pixels in the feature map, Yc is the probability of the c-th class, Akij is the k-th The activation value of the feature map at position (i, j). Then the CAM map of class c at the spatial position (i, j) is obtained by the formula. ReLU is used to ignore features that have a negative impact on the target class.
insert image description here
The pre-trained CLIP model includes two architectures, resnet-based and vit-based. Note that Grad-CAM is not only applicable to CNN-based architectures, but also to vit. In this paper, we exploit the vi-based CLIP model since the CNN-based CLIP model cannot explore the global context and is heavily affected by partial domain distinctions.
Our work adapts GradCAM to CLIP. In vanilla GradCAM, the final score is the logits before the softmax function. CLIP is trained with cross-entropy loss and softmax, but the class confusion problem still exists in experiments. We hypothesize that this is because the training data for CLIP are image-text pairs, rather than a fixed set of independent categories. For an image, the corresponding text segment may contain several classes of visual concepts, which also cannot compete with each other via softmax. This paper introduces the softmax function into GradCAM, so that different categories are mutually exclusive. Specifically, the final score computed by softmax is as follows:
insert image description here
Sc is the class c score after softmax. The processed scores are used to compute the gradient, and the class feature weights can be computed as:

insert image description here
Weights indicating target feature maps will be suppressed by non-target classes. Therefore, the cam corresponding to the target class can be modified by the rest of the classes. However, competition is limited to the categories defined in the dataset.
insert image description here

Text-driven Strategies

Sharpness-based Prompt Selection

We find that prompt ensembles perform differently on classification tasks and weakly supervised segmentation tasks.
We suspect that this discrepancy is mainly due to the different number of labels per image. Classification datasets, such as ImageNet, are single-label, while segmentation datasets, such as PASCAL VOC, are multi-label. The former aims to assign the maximum score to a unique object class, while the latter needs to consider all object classes in the image. We believe that fast integration will make the highest-scoring object class more salient.
To test our conjecture, we design a metric, sharpness, to measure the distribution of object class scores for multi-label images using different cues. This measure is inspired by the coefficient of variation, a widely used measure in statistics. Assuming that there are n images in the dataset, and there are k (k >= 1) classes in one image, the sharpness based on a specific prompt can be calculated as follows: Sij represents the score
insert image description here
of the jth class after softmax in the ith image. Since the V coefficient of variation is unstable when the mean is close to 0, we use variance instead of standard deviation to highlight the effect of dispersion.
We compare the sharpness and corresponding segmentation results of 20 randomly sampled ImageNet prompts from CLIP1 on the Pascal VOC 2012 training set. The results show that our proposed metric is roughly negatively correlated with segmentation performance. Thus, sharpness can conveniently guide quick selection, and only image-level labels are required. After trial and error, we found abstract descriptions like "origami" and "rendering", and adjectives like "clean", "large" and "weird", Has a positive impact on segmentation performance. We ended up choosing "a clean origami{}". As our cue, it has the least clarity.

Synonym Fusion

The category names provided in the dataset are limited, and we use synonyms to enrich semantics and remove ambiguity. There are various strategies to incorporate the semantics of different synonyms, such as sentence-level, feature-level or cam-level. In this paper, we incorporate synonyms at the sentence level. In particular, we put different synonyms in one sentence, e.g., "clean origami of a person, person, human being". This disambiguates when faced with polysemous words, and is time-efficient since other methods require multiple forward passes. Synonyms can be easily obtained from WordNet or more recently glove word embeddings. Also, the performance of some classes can be further improved by tailoring specific words. For example, "person" cams tend to focus on the face, while standard segmentation masks the entire body. In CLIP, "person" and "clothes" are likely to be considered as two different classes. Replace "person" with "person wearing clothes"

Class-aware Attention-based Affinity (CAA)

Recently, some works use the attention obtained from the transformer as semantic-level affinity to refine initial CAMs. But the improvement is limited, and they still need an extra network or extra layers to further refine the CAMs. This is because the original Multi-Head Self-Attention (MHSA) is hierarchy-agnostic, while CAM is hierarchy-aware. Exploiting MHSA directly can amplify noise by propagating noisy pixels to their semantically similar regions during refinement.
We propose class-aware attention-based affinity (CAA) to improve MHSA. Given an image, we can get a classification CAM map Mc ∈ Rh×w and attention weights W attn ∈ Rhw×hw for each object class c from MHSA.
For attention weights that are asymmetric due to different projection layers used by queries and keys, we use Sinkhorn normalization (applying row normalization and column normalization alternately) to transform it into a double random matrix D, which can be obtained by symmetric parent and matrix a:
insert image description here
For a CAM map Mc ∈ Rh × w, we can obtain a mask map for each object class c by thresholding the CAM of that class with λ. We find connected regions on MASK and cover these connected regions with a minimum rectangular bounding box. These boxes mask the affinity weight A, and then each pixel can be refined by semantically similar pixels according to the masked affinity weight. To address the extreme incompleteness of initial CAMs, we use bounding box masks instead of pixel masks to cover more object regions. We repeat this refinement many times, and this process can be formalized as follows.
insert image description here
where Bc ∈ R1×hw is the mask obtained from CAM of class c, ⊙ is the Hadamard product, t is the number of refinement iterations, and vec( ) is matrix vectorization. Note that we extract the attention map and CAM with the same forward pass. Therefore, CAA refinement is real-time and does not require additional stages like previous work.

Confidence-guided Loss (CGL)

Each pixel in the CAM represents the confidence that this location belongs to the target class. Most methods generate pseudo-masks from the cam by simply setting a threshold to distinguish the target object from the background. It can introduce noise into the pseudo-mask, since those low-confidence locations are too uncertain to belong to the correct class. Therefore, we try to ignore those unconfident positions and propose a confidence-guided loss (CGL) to fully exploit the generated CAM. Specifically, given a CAM map X∈Rh×w×c of an image with c object classes, the confidence map can be obtained as:
insert image description here
where L(i, j) is the semantic segmentation model prediction of pixel (i, j) Cross-entropy loss between values ​​and fake masks, µ is a hyperparameter to ignore low-confidence pixels

Ablation experiment

insert image description here

Guess you like

Origin blog.csdn.net/qq_45745941/article/details/129946250