Original link: CVPR2019 & TPAMI 2020
https://arxiv.org/abs/1908.10028 https://arxiv.org/abs/1908.10028 code connection:
tensorflow version https://github.com/junsukchoe/ADL
Table of contents
Table of contents
1. The problem that the article wants to solve
Fourth, the advantages and disadvantages of this article
1. The problem that the article wants to solve
What this paper mainly wants to solve is a typical problem in weakly supervised object localization (WSOL): the locator often only pays attention to the most discriminative part of the target object.
2. Basic ideas
As shown in Figure 1, this paper mainly includes two key designs:
- Blocking the most discriminative regions during training forces the model to focus on less discriminative regions of objects;
- Highlight areas with a lot of information to improve the recognition ability of the model.
3. Basic plan
The model design in this article is very simple, basically by looking at Figure 1 above it can be seen at a glance. After selecting any classification network (VGG, ResNet, MobileNet V1 and Inception-V3 were tried in this article), insert the ADL module after selecting (each) feature map.
The input of the ADL module is the output of the feature map of the i-th layer ,
First, average pooling is performed by channel to obtain an attention map . Based on the attention map, the drop mask is obtained by setting the threshold = * max(attention map intensity) (if it is greater than the set threshold, it is set to 0, otherwise it is 1, and the hyperparameter can be set according to the performance. For example, in the original text, VGG and Inception-v3 use 0.8 for resnet, 0.9 for mobilenet, and 0.95 for mobilenet).
At the same time, the importance map is obtained through the sigmoid function in the attention map. It is quite reasonable to choose the sigmoid function here. On the one hand, it suppresses smaller values in the attention map (tends to 0), and at the same time suppresses excessive values (tends to 1).
After obtaining the drop mask and importance map, one of them is randomly selected as the final map and the feature map for each training , and then the output is used as the input of the next layer of network. The random selection is controlled according to the hyperparameter drop_rate, which is set to 75% in the original text, which means that there is a probability of 0.75 to select the drop mask as the final map.