Extract class activation maps from non-discriminative features

Extracting Class Activation Maps from Non-Discriminative Features as well

Need to know CAM before watching

Summary

Purpose :
Extracting class activation maps (CAMs) from classification models often results in low coverage of foreground objects, i.e. only distinguishable regions (e.g. “head” of “sheep”) are identified, while the remaining regions (e.g. “sheep” "'s "legs") are mistakenly used as the background. The key behind this is that the weights of the classifier (used to compute the CAM) only capture the discriminative features of the object.

Our approach :
Explicitly captures non-discriminative features, thereby extending CAM to cover entire objects.

  • Specifically, the last pooling layer of the classification model is omitted, and all local features of object classes are clustered, where "local" means "at spatial pixel locations". The resulting K-clustering centers are called local prototypes—representing local semantics, such as "head", "leg" and "body" of "sheep".
  • Given a new image of a class, its unmerged features are compared with each prototype, resulting in K similarity matrices, which are then aggregated into a heatmap (i.e. our CAM).
  • Therefore, CAM indiscriminately captures all local features of the class.

Code link
effect display
insert image description here

related work

Image classification models are optimized to capture only discriminative local regions (features) of objects, resulting in poor object coverage by their CAMs.

seed generation

erasure-based approach

  1. AE-PSL is an adversarial erasure strategy that operates iteratively: it masks discriminative regions in the current iteration to explicitly force the model to discover new regions in the next iteration
  2. ACoL is an improved approach: the framework consists of two branches. One branch applies feature-level masking to the other branch.

However, both methods have the problem of excessive erasure, especially for small objects.

  1. CSE is a class-specific erasure method: a random object class is masked based on CAM, and then the prediction of the erased class is explicitly penalized. In this way, it gradually approaches the boundary of the object on the image. Not only can it find more non-discriminative regions, but it can also alleviate the over-erasing problem (of the above two methods), because it also penalizes the over-erased regions.

Erase-based methods suffer from inefficiency since they have to forward-propagate the image many times

Other methods

  1. RIB uses the information bottleneck principle to explain the low coverage of CAM: By omitting the activation function of the last layer, the multi-label classifier is retrained, and the information in the information-indiscriminate area is encouraged to pass to the classifier.
  2. Others have empirically observed that classification models that take local image patches as input, rather than the entire input image, can discover more discriminative regions. They propose a local-to-global attention transfer method that consists of a local network that generates local attention with rich object details for a local view, and a global network that receives the global view as input and aims to Extracting Discriminative Attention Knowledge from Local Networks.
  3. There are also researchers exploring the use of contrastive learning, graph neural networks, and self-supervised learning to discover more non-discriminative regions.

MASK Refinement

Segmentation method

  1. Propagate object regions in seeds to semantically similar pixels in neighboring regions: implemented by random walks on a transformation matrix, where each element is an affinity score. Correlation methods have different designs for this matrix
  2. PSA is an AffinityNet for predicting semantic affinity between adjacent pixels
  3. IRN is an inter-pixel relational network for estimating class boundary maps and computing affinities based on them
  4. BES which learns to predict boundary maps by using the CAM as the pseudo ground truth

All of these methods introduce additional network modules to normal CAM

Other methods

  1. Using saliency maps to extract context cues and combine them with target cues
  2. EPS proposes a joint training strategy combining CAM and saliency maps
  3. EDAM introduces a post-processing method to incorporate confident regions in saliency maps into CAM

method

The method in this paper can be integrated into the above method

LPCAM Pipeline

insert image description here

  • Use standard ResNet50 as the network backbone (i.e., feature encoder) for multi-label classification models to extract features
  • Before clustering and local prototyping of context, we need coarse location information for foreground and background. We use traditional CAM to achieve this. Given a feature f(x) and the corresponding classifier weights wn in the FC layer, we extract it for each individual class n as follows:
    insert image description here

Clustering
performs clustering for each individual class. Given an image sample x of a class n, we spatially divide the feature block f(x) into two sets f and B based on CAM and
insert image description here
perform K-means clustering on F and B to obtain their respective K classes Center, where K is a hyperparameter. We denote the foreground cluster center as F = {F1,···,FK} and the background cluster center as B = {B1,···,BK}.

Selecting Prototypes

Traditional CAMs have inaccurate or incomplete masks, for example, background features may be grouped as f. To solve this problem, we need an "evaluator" to check the eligibility of the cluster centers as prototypes. An intuitive approach is to use the classifier wn as an automatic "evaluator": use it to compute a prediction score for each cluster center Fi in F: then, we
insert image description here
select those centers with high confidence: zi > µf, where µf is a threshold, usually a very high value like 0.9. We denote the chosen values ​​by F' = {F'1,···,F'k'1}. Confident predictions indicate strong local features of the class, i.e. prototypes.
Before using these local prototypes to generate LPCAM, it is emphasized that in the implementation of LPCAM, not only non-discriminative features are preserved, but also strong contextual features (i.e. false positives) are suppressed, Because context prototypes are convenient to extract and apply - similar to class prototypes, but in the opposite way. We elaborate on these below. For each Bi in the context cluster center set B, we apply the same method as Fi to compute the prediction score:
insert image description here
intuitively, if the model is well trained on the class labels, then its prediction of the context features should be very low .
Therefore, we choose centers where zi < µb (where µb is usually a value like 0.5), and denote them as b' = {b'1,···,b'k'2}. It is worth noting that our method is insensitive to the values ​​of the hyperparameters µf and µb, given reasonable ranges, e.g. µf should have a large value around 0.9. We verify this empirically in section

Generate LPCAM

For each prototype, we slide it to all spatial locations on the feature map patch and compute its similarity to the local features at each location. We take cosine similarity, just like we did with K-Means. Finally, we get the cosine similarity map between prototypes and features. After computing all similarity maps (by sliding all local prototypes), we aggregate them as follows:
insert image description here
FGn highlights the class region associated with the nth prototype in the input image, while BGn highlights the context region. The former needs to be preserved, and the latter (e.g., pixels highly correlated with the background) should be removed. Therefore, we can express LPCAM as follows:
insert image description here

prove

Prove that there is time to make up, you can read the paper for details
insert image description here

Ablation experiment

insert image description here
Just do a
sensitivity analysis about the threshold value of foreground and background calculation:
insert image description here

Summarize

The crux of the low coverage of conventional CAMs is that the global classifier only captures the discriminative features of objects. This paper proposes a new method called LPCAM that utilizes both discriminative and non-discriminative local prototypes to generate activation maps that fully cover objects.

Guess you like

Origin blog.csdn.net/qq_45745941/article/details/129911590