Weakly supervised learning series: Attention-Based Dropout Layer for Weakly Supervised Single Object Localization and Semantic

Original link: CVPR2019 & TPAMI 2020

https://arxiv.org/abs/1908.10028 https://arxiv.org/abs/1908.10028 code connection:

tensorflow version https://github.com/junsukchoe/ADL

The pytorch version implemented in the wsol evaluation paper icon-default.png?t=M276https://github.com/clovaai/wsolevaluation

Table of contents 

Table of contents

1. The problem that the article wants to solve

2. Basic ideas

3. Basic plan

Fourth, the advantages and disadvantages of this article


1. The problem that the article wants to solve

        What this paper mainly wants to solve is a typical problem in weakly supervised object localization (WSOL): the locator often only pays attention to the most discriminative part of the target object.

2. Basic ideas

        As shown in Figure 1, this paper mainly includes two key designs:

  1. Blocking the most discriminative regions during training forces the model to focus on less discriminative regions of objects;
  2. Highlight areas with a lot of information to improve the recognition ability of the model.
Figure 1 Algorithm architecture diagram, the drop mask aims to block the most discriminative area in the attention map, and the importance map aims to highlight the informative area. During the training process, the drop mask or importance map is randomly selected as the final map and the input feature map does a dot product.

3. Basic plan

        The model design in this article is very simple, basically by looking at Figure 1 above it can be seen at a glance. After selecting any classification network (VGG, ResNet, MobileNet V1 and Inception-V3 were tried in this article), insert the ADL module after selecting (each) feature map.

The input of the ADL module is the output of the feature map of the i-th layer  F_i \in \mathbb{R}^{h_i \times w_i \times c_i},

First, average pooling is performed by channel to obtain an attention map  A_i \in \mathbb{R}^{h_i \times w_i}. Based on the attention map, the drop mask is obtained by setting the threshold =  \gamma * max(attention map intensity) (if it is greater than the set threshold, it is set to 0, otherwise it is 1, and the hyperparameter \gamma can be set according to the performance. For example, in the original text, VGG and Inception-v3 use 0.8 for resnet, 0.9 for mobilenet, and 0.95 for mobilenet).

 At the same time, the importance map is obtained through the sigmoid function in the attention map. It is quite reasonable to choose the sigmoid function here. On the one hand, it suppresses smaller values ​​in the attention map (tends to 0), and at the same time suppresses excessive values ​​(tends to 1).

After obtaining the drop mask and importance map, one of them is randomly selected as the final map and the feature map for each training  F_i \in \mathbb{R}^{h_i \times w_i \times c_i} , and then the output is used as the input of the next layer of network. The random selection is controlled according to the hyperparameter drop_rate, which is set to 75% in the original text, which means that there is a probability of 0.75 to select the drop mask as the final map. 

Guess you like

Origin blog.csdn.net/yangyehuisw/article/details/123847820