Learning Deep Features for Discriminative Localization paper notes

Paper address: https://arxiv.org/abs/1512.04150
github address: http://cnnlocalization.csail.mit.edu

This paper proposes a way to study the interpretability of the network, using global average pooling to obtain the attention mechanism of the network.

Motivation

In the past, global average pooling (GAP) was usually used to regularize training convolutional neural networks, and the author found that this operation actually gave convolutional neural networks the ability to locate. And because the last fully connected layer of the general network loses spatial information, the output is not interpretable. Therefore, in order to make the final result spatially interpretable, the author proposes to use GAP to replace the last fully connected layer, so that the obtained feature map can reflect the spatial recognition ability of the network after simple processing, that is, the positioning ability.

Methods

The author proposes a class activation layer (Class Activation Mapping, CAM), which is generated as shown in the figure below. The network training process is adjusted to add GAP after the last convolutional layer of the network, remove the original fully connected layer, obtain the score of each category, and then obtain the probability value through the softmax layer. After the network converges, the output of each convolutional layer is multiplied by the weight of the corresponding classification of the layer, and then the result is weighted. After upsampling, the heat map can be obtained, and the CAM is obtained. After being superimposed with the original image You can get the effect on the right side of the equation below.
CAM
In the article, the author proposes that the output size of the convolutional layer of the last layer is generally selected to be around 14*14, and the subsequent convolutional layer is removed and directly connected to the GAP layer.

Experiment

Models: AlexNet, VggNet, GoogLeNet, GoogLeNet-GMP
Dataset: ILSVRC

  1. Classification performance comparison:
    classification performance
  2. Positioning ability comparison
    Positioning ability
  3. Comparison between GAP and GMP (average pooling and maximum pooling):
    insert image description here
  4. In addition, in the experiment, the author proposes to use the threshold method to obtain the prediction frame of the target through CAM. The result is better overall than the backpropogation method in the weakly supervised mode, but the difference is relatively large in the fully supervised mode.

Thoughts

The idea of ​​this article is quite good, but I don't think it can be used in the current framework of my own, because the limitations of CAM are relatively large, there are requirements for the structure of the framework, and there are some limitations for the target detection network. However, I think another Grad-CAM may be relatively suitable for my current framework. It is a heat map based on gradients, and its adaptability is much better than that of CAM. This paper will be sorted out in the next blog.

Guess you like

Origin blog.csdn.net/qq_43812519/article/details/105777157