Learning Deep Features for Discriminative Localization -CAM method helps if supervised learning research realizes object localization paper reading notes

Author: 18th president Cui Yunlong

Period: 2020-9-11

论文《Learning Deep Features for Discriminative Localization》

Journal: 2016CVPR

1. Brief introduction:

  • It is an article on CVPR in 2016. It is a great inspiration for the later research on weakly supervised learning.

  • This paper proposes a technology called class activation map (CAM) for CNN networks that use global average pooling (GAP). This technology allows the CNN network to perform image recognition and object localization without border annotation data.

  • In supervised learning, classification problems need a data set with category labels, and positioning problems need a data set with BBox (BoundingBox) labels. The loss between the prediction and the true value is calculated and optimized to achieve the purpose of network training.

  • For a data set that only provides classification labels, but needs to complete the network training of the two functions of classification and positioning, it is a weakly supervised learning problem.

2. Problem statement

  1. Convolutional neural networks have impressive performance in visual recognition tasks, and some studies have found that they also have excellent capabilities in positioning.
  2. And this positioning ability will be lost when the FC fully connected layer is used for classification.
  3. We found that using GAP (global average pooling) instead of the fully connected layer can maintain the network's ability to locate objects without reducing the network's classification ability, and has fewer parameters than a fully connected network.
  4. So this article proposes a CAM (Class Activation Mapping) method, which can visualize the position of the original image according to the classification basis (corresponding category features in the image) used in image classification for the CNN network using GAP, and draw it into a heat map. Use this as the basis for object positioning.

Three, how to do it

1. Class Activation Mapping CAM

CAM is mainly realized through GAP (Global Average Pooling).

Global average pooling Golbal Average Pooling GAP:

GAP was not proposed by this article. Golbal Average Pooling first appeared in the paper Network in Network, mainly used for regularizing training. Later, a lot of work continued to use GAP, and experiments proved that Global Average Pooling can indeed improve the effect of CNN.
Insert picture description here
Average pooling: Sliding in the form of a window on the feature map (similar to convolutional window sliding), the operation is to take the average value in the window as the result, after the operation, the feature map is down-sampled to reduce the over-fitting phenomenon.
Insert picture description here
Global average pooling GAP: The average value is not taken in the form of a window, but is averaged in the unit of feature map. That is, a feature map outputs a value.
Insert picture description here
Generally speaking, GAP is behind the last convolutional layer, and after GAP is usually the softmax layer.
In the previous convolutional neural network, the convolutional layer always requires one or n fully connected layers after passing through the pooling layer (usually max pooling), and finally classified in softmax. The disadvantage is that the fully connected layer has too many parameters, which makes the model itself very bloated and easy to overfit.
Replacing the fully connected layer before the softmax layer with GAP can make up for these shortcomings. The specific operation is that the feature map output by the convolution of the last layer has N channels, and then the global pooling operation is performed on this feature map to obtain a vector of length N.

Formal definition:

The following will focus on how to generate CAM through GAP (CAM sometimes refers to the process of Class Activation Mapping, sometimes refers to the resulting Class Activation Map, here refers to the latter).

The global average pooling layer outputs the average of the feature map of each unit of the last convolutional layer. The weighted sum of these values ​​is used to generate the final output. It can also be said that we calculate the weighted sum of the last convolutional layer feature maps to obtain our CAM.

For a given image, f_k (x, y) represents the activation value at position (x, y) on the feature map of the kth channel output by the last convolutional layer. Using GAP, you can get resultsInsert picture description here

Then for a certain category c, the input value of softmax isInsert picture description here

Finally, the value of category c.
Through the above formula, we can expand S_c as follows:
S_c=∑_k▒w_kc  F_k=∑_k▒w_kc  ∑_(x,y)^▒f_k (x,y)=∑_(x,y)▒∑_k▒w_k^c  f_k (x,y)

We define the CAM belonging to a certain category c as:
M_c (x,y)=∑_k▒w_kc  f_k (x,y)

Simply put:
M=w_1·f_1+w_2·f_2+...+w_n·f_n

Let us use wn to denote the weight of connecting the kth node of the Flatten layer with the output node corresponding to the predicted image category.
Insert picture description here
The calculation process is shown in the figure.
Finally, upsample the generated CAM image to the size of the original image.
Insert picture description here
The figure above shows the CAM images of two types of images. Highlight the discriminative areas of image classification, such as the head of the Chow Chow and the barbell.
Insert picture description here
Shows the top5 CAM images of a specific image.

Why use GAP instead of GMP?

Because GMP encourages our network to focus on only one discriminaltive part, GAP encourages the network to identify the entire range of objects. We verified this idea with the ILSVRC data set in the third part: GMP's classification performance is equivalent to GAP, and GAP's positioning ability is stronger than GMP.

2. Experiments on weakly supervised object positioning

setup:

With the main network architectures of AlexNet, VGGnet and GoogLeNet, a convolutional layer, GAP and softmax layers are added respectively, and then the number of convolutional layers is appropriately increased or decreased (because we found that the last convolutional layer before GAP has more The positioning ability of our network is better when the spatial resolution is high), three kinds of networks are generated, AlexNet-GAP, VGGnet-GAP, GoogLeNet-GAP.

Experimental results: Evaluated the positioning ability of CAM

classification:

As shown in the figure below, in addition to AlexNet, compared with the corresponding network, the classification effect of *-GAP only dropped by 1 to 2%. As Alex's classification accuracy has dropped a lot, the Alex*-GAP network is constructed, that is, two convolutional layers are added to the original AlexNet instead of one. Now the classification accuracy of Alex*-GAP and the original AlexNet are about the same.

Insert picture description here

Positioning:

In order to do localization, we need to generate a bounding box and its associated object category. In order to generate the bounding box of the CAM, we use a simple threshold to divide the heat map. We first use a value greater than 20% of the maximum value of the CAM to make a border, and then we use the largest connected component in the coverage segmentation image to make a border.

The results are shown in the following two tables. The accuracy of GoogLeNet-GAP is similar to that of the fully-supervised AlexNet (37.1 VS 34.2), but compared with the corresponding network, the gap is still very large. So, there is still a long way to go on the road to weakly supervised.


Insert picture description here
image-20200825221807055
Insert picture description here
The CAM map and saliency map display of the two pictures under different networks.
Insert picture description here
Examples of the positioning of the four pictures in the GoogleNet-GAP model, the real frame is green, and the predicted frame is red.

Insert picture description here
The positioning using the CAM map (the two above) is compared with the positioning using the saliency map. The real frame is green and the predicted frame is red.

Third, the universal experiment of positioning features

Conclusion : Even on unfamiliar data (without training), our network can be positioned
on other data sets and use the features we extracted for classification. The results are shown below. It can be seen that the ordinary GoogLeNet feature extraction capabilities and GoogLeNet-GAP features The extraction capacity is similar. This shows that *-GAP does not reduce the feature extraction capability, and certainly does not improve it.
Insert picture description here
Let's take a look at its Localization capabilities. Some experimental results are shown below. It can be seen that although it is not trained with these data sets, it can still locate the prominent feature areas of the objects in the figure.
Insert picture description here

in conclusion

Summary:
Through CAM, the article can locate the position of the object to be recognized by not using position markers. This technique provides inspiration for many weakly supervised learning.
Disadvantages:
It can only find out some prominent feature areas of the object, such as the head of a dog, which makes it easy to locate only a part of the object when positioning. This is also an urgent problem in weakly supervised learning methods.

Guess you like

Origin blog.csdn.net/cyl_csdn_1/article/details/109061460