CAM paper translation

原文:Learning Deep Features for Discriminative Localization

Summary

  In this work, we revisit the global average pooling layer proposed in [13] and illustrate how it explicitly enables convolutional neural networks to have excellent localization capabilities despite the network being on image-level labels. for training. While this technique has been previously proposed as a means of regularizing training, we find that it actually builds a general localizable deep representation that can be applied to a variety of tasks. Despite the deceptive simplicity of global average pooling, we achieved a top-5 error rate of 37.1% on the object localization task of ILSVRC 2014, which is very close to the 34.2% top-5 error rate achieved by fully supervised CNN methods. We demonstrate that our network is able to localize discriminative image regions even though it was not trained for these tasks.

1 Introduction

  Recent research by Zhou et al. [33] shows that the convolutional units in each layer of convolutional neural networks (CNNs) can actually serve as object detectors, although no supervisory information about the location of the object is provided. Although convolutional layers have significant ability in localizing objects, this ability is lost when using fully connected layers for classification. Recently, some popular fully convolutional neural networks, such as Network-in-Network (NIN) [13] and GoogLeNet [24], were proposed to avoid using fully connected layers, thus reducing the number of parameters while maintaining high performance.
  To achieve this, literature [13] uses global average pooling as a structural normalizer to prevent overfitting during the training process. In our experiments, we found that the advantage of this global average pooling layer is not just as a regularizer - in fact, with a little tweaking, the network can maintain its excellent localization capabilities all the way to the last layer. This adjustment allows the network to easily identify discriminating image regions in a single forward pass for a variety of tasks, even those for which the network was not originally trained. As shown in Figure 1(a), a convolutional neural network trained on object classification is successfully able to locate discriminative regions in action classification that refer to the objects humans are interacting with rather than Humanity itself.
Insert image description here

Figure 1: By simply modifying the global average pooling layer, combined with our class activation mapping (CAM) technology, a classification-trained CNN can complete image classification and locate category-specific image regions in a single forward pass. , such as a toothbrush for brushing teeth and a chain saw for cutting down trees.

  Despite the superficial simplicity of our approach, our best network achieves a top-5 test error of 37.1% on the weakly supervised object localization task of the ILSVRC benchmark, which is on par with that achieved by the fully supervised AlexNet [10] The top 5 test error rate of 34.2% is very close. Furthermore, we demonstrate that the localizability of deep features in our approach can be easily transferred to other recognition datasets for general classification, localization, and concept discovery.

1.1.Related work

  Convolutional neural networks (CNNs) have achieved impressive performance on various visual recognition tasks [10, 34, 8]. Recent studies have shown that CNNs have significant ability to localize objects despite being trained on image-level labels [1, 16, 2, 15]. In this work, we show that by using the right architecture, we can generalize this ability beyond just locating objects and start to identify exactly which regions in the image are used for discrimination. Here, we discuss the two research directions most relevant to this paper: weakly supervised object localization and visualizing internal representations of CNNs.
   Weakly supervised object localization: There have been many recent studies exploring methods of using CNNs for weakly supervised object localization [1, 16, 2, 15]. Bergamo et al. [1] proposed a self-learning object localization technique that involves masking image areas to identify areas that result in maximum activation to localize objects. Cinbis et al. [2] combined multi-instance learning with CNN features for object location. Oquab et al. [15] proposed a method for transferring mid-level image representations and showed that a certain degree of object localization can be achieved by evaluating the output of a CNN on multiple overlapping patches. However, these authors did not actually assess localization ability. On the other hand, while these methods achieve promising results, they are not trained end-to-end and require multiple forward passes through the network to localize objects, which makes them difficult to scale to real-world datasets. Our method is trained end-to-end and can localize objects in a single forward pass.
   The method most similar to ours is the work of Oquab et al. [16] based on global max pooling. Instead of using global average pooling, they apply global max pooling to locate a point on the object. However, their localization is limited to a point located at the boundary of the object rather than determining the complete extent of the object. We believe that while the max and mean functions are quite similar, using average pooling encourages the network to recognize the full range of objects. The basic intuition behind this is that compared to max pooling, the loss of average pooling is more beneficial when the network recognizes all discriminative regions of the object. This is explained in more detail in Section 3.2 and confirmed experimentally. Furthermore, unlike [16], we show that this localization capability is general and can be observed even for problems for which the network has not been trained.
   We use “class activation map” to refer to the weighted activation map generated for each image, as described in Section 2. We would like to emphasize that although global average pooling is not a new technique that we propose here, to the best of our knowledge, finding that it can be used for precise discriminative localization is our unique observation. We believe the simplicity of this technique makes it portable and can be applied to a variety of computer vision tasks, enabling fast and accurate localization.
   About visualizing CNNs: There are many recent works [29, 14, 4, 33] trying to better understand CNNs by visualizing their learned internal representations characteristics. Zeiler et al. [29] use deconvolutional networks to visualize the pattern of activation of each unit. Zhou et al. [33] showed that CNNs learned object detectors when trained to recognize scenes, and proved that the same network can complete scene recognition and object localization in a single forward pass. These two works only analyzed the convolutional layers and ignored the fully connected layers, so the complete picture of the whole story was insufficiently understood. By removing the fully connected layers and retaining most of the performance, we are able to gain a comprehensive understanding of our network from start to finish. Mahendran et al. [14] and Dosovitskiy et al. [4] analyzed the visual coding of CNNs by inverting deep features at different levels. Although these methods can invert fully connected layers, they only show what information is saved in the deep features without highlighting the relative importance of this information. Unlike [14] and [4], our method can accurately highlight which regions in the image are important for discrimination. Overall, our approach provides another perspective into the inner workings of CNNs.

2. Class activation mapping

  In this section, we describe the steps to generate class activation maps (CAMs) in CNNs using global average pooling (GAP). A class activation map for a specific category shows the discriminative image regions used by the CNN to identify that category (e.g., Figure 3). The steps to generate these mappings are shown in Figure 2.
Insert image description here
Figure 3: Class activation maps (CAMs) for four categories from ILSVRC [20]. These maps highlight discriminative image regions for image classification, e.g., animal heads in briard and hen, plates in barbell, and bell bells in bell cote.

Insert image description here
Figure 2: Class Activation Maps (CAMs): The predicted class scores are mapped back to the previous convolutional layer to generate class activation maps (CAMs). CAM highlights category-specific discriminative regions.

  We use a network architecture similar to Network in Network [13] and GoogLeNet [24] - this network mainly consists of convolutional layers, and before the final output layer (softmax layer in classification tasks), we Perform global average pooling on the convolutional feature map and use it as features for the fully connected layer that produces the desired output (classification or other). Given this simple connection structure, we can determine the importance of image regions by projecting the weights of the output layer back onto the convolutional feature map, a technique we call class activation mapping.
  As shown in Figure 2, global average pooling outputs the spatial average of the feature map of each unit of the last convolutional layer. The weighted sum of these values ​​is used to generate the final output. Likewise, we compute the weighted sum of the feature maps of the last convolutional layer to obtain our class activation map. We describe the softmax case more formally below. The same technique can be applied to regression and other loss functions.
  For the given image, let f k ( x , y ) { {f}_{k}}\left( x,y \right) fk(x,y) Display space position ( x , y ) (x,y) (x,The activation of the last convolutional layer unit k at y). Then, for unit k, the result of performing global average pooling F k F^k Fk ∑ x , y f k ( x , y ) \sum\nolimits_{x,y}{ { {f}_{k}}\left( x,y \right)} x,yfk(x,y) . Therefore, for a given category c, input to the softmax function, S c S_c Sc , 为 ∑ k w k c F k \sum\nolimits_{k}{w_{k}^{c}}{ {F}_{k}} kInkcFk , inside w k c {w_{k}^{c}} Inkc is the weight of unit k corresponding to category c. In essence, w k c {w_{k}^{c}} Inkc means that for category c, F k F^k FThe importance of k. Finally, the softmax output of category c P c P_c Pc exp ⁡ ( S c ) ∑ e exp ⁡ ( S c ) \frac{\exp \left( { {S}_{c}} \right)}{\sum\nolimits_{e}{\exp \left( { {S}_{c}} \right)}} eexp(Sc)exp(Sc) is given. Here, we ignore the bias term: we explicitly set the input bias of softmax to 0 since it has little or no impact on classification performance.
  ˆ ∑ x , y f k ( x , y ) \sum\nolimits_{x,y}{ { {f}_{k}}\left( x,y \right)} x,yfk(x,y) Assignment category S c S_c ScIn , we get
S c = ∑ k w k c ∑ x , y f k ( x , y ) = ∑ x , y ∑ k w k c f k ( x , y ) . ( 1 ) \begin {aligned}S_c&=\sum_kw_k^c\sum_{x,y}f_k(x,y)\\&=\sum_{x,y}\sum_kw_k^cf_k(x,y).\end{aligned} \quad\quad\quad(1) Sc=kInkcx,yfk(x,y)=x,ykInkcfk(x,y).(1)
  My General M c M_c < /span>Mc is defined as the class activation map of category c, where each spatial element is represented as
M c ( x , y ) = ∑ k w k c f k ( x , y ) . ( 2 ) M_ {c}(x,y)=\sum_{k}w_{k}^{c}f_{k}(x,y).\quad\quad\quad(2) Mc(x,y)=kInkcfk(x,y).(2)
因此, S c = ∑ x , y M c ( x , y ) S_c=\sum_{x,y}M_c(x,y) Sc=x,yMc(x,y),因此 M c ( x , y ) M_c(x, y) Mc(x,y) Directly displaying the empty space ( x , y ) (x, y )(x,The importance of activation at y) for classifying the image into class c.
  Intuitively, based on previous studies [33, 29], we expect each unit to be activated by a certain visual pattern within its receptive field. Therefore, f k f_k fk is a map indicating the existence of this visual mode. A class activation map is simply a weighted linear summation of the presence of these visual patterns at different spatial locations. By simply upsampling the class activation map to the size of the input image, we can determine the image regions that are most relevant to a specific class.
  In Figure 3, we show some examples of class activation maps (CAMs) output using the above method. We can see that the distinctive areas of the various categories of images are highlighted. In Figure 4, we highlight the differences in CAMs for a single image when mapping is generated using different categories c. We observe that even for a given image, the discriminative regions are different for different categories. This shows that our method achieves the expected results. We will show this quantitatively in the next section.
Insert image description here
Figure 3: Class activation maps (CAMs) for four categories from ILSVRC [20]. These maps highlight discriminative image regions for image classification, e.g., animal heads in briard and hen, plates in barbell, and bell bells in bell cote.

Insert image description here
Figure 4: Example of class activation maps (CAMs) generated from the top 5 predicted classes of a given image, with the true class being "dome". The predicted categories and their scores are shown above each category activation map. We observe that the highlighted areas under different prediction categories are different, for example, "dome" activates the upper round part, while "palace" activates the lower flat part of the building.

   **Intuitive differences between Global Average Pooling (GAP) and Global Max Pooling (GMP): ** Considering previous work on using GMP for weakly supervised object localization [16], we believe that highlighting the difference between GAP and GMP The intuitive difference is important. We argue that the GAP loss encourages the network to recognize the entire scope of an object, whereas the GMP, in contrast, encourages the network to recognize only a single discriminative part. This is because when averaging image feature maps, finding all discriminative parts of an object maximizes the value, since all low activations reduce the output of a specific feature map. On the other hand, for GMP, low scores in all image regions except the most discriminative parts do not affect the score, since only the maximization operation is performed. We conduct experiments on the ILSVRC dataset to verify this, which can be seen in Section 3: Although GMP performs similarly to GAP in classification performance, GAP outperforms GMP in terms of localization.

3. Weakly supervised object positioning

   In this section, we evaluate the localization capabilities of CAM when trained on the ILSVRC 2014 benchmark dataset [20]. The experimental setup and various CNNs used are first described in Section 3.1. Then, in Section 3.2, we verify that our technique does not adversely affect classification performance when learning localization and provide detailed results on weakly supervised object localization.

3.1. Experimental setup

   For our experiments, we evaluate the impact of using CAM on the following popular CNNs: AlexNet [10], VGGnet [23] and GoogLeNet [24]. In general, for each of these networks, we remove the fully connected layers before the final output and replace them with global average pooling (GAP) followed by a fully connected softmax layer.
   We find that the localization ability of the network is improved when the last convolutional layer before global average pooling (GAP) has a higher spatial resolution, which we call mapping resolution. To do this, we removed several convolutional layers from some networks. Specifically, we made the following modifications: for AlexNet, we removed the layers after conv5 (i.e., from pool5 to prob), resulting in a mapping resolution of 13 × 13. For VGGnet, we remove the layers after conv5-3 (i.e. from pool5 to prob), resulting in a mapping resolution of 14 × 14. For GoogLeNet, we removed the layers after inception4e (i.e. from pool4 to prob), resulting in a mapping resolution of 14 × 14. For each of the above networks, we added a convolutional layer with 1024 units of size 3 × 3, stride 1, and padding 1, followed by a GAP layer and a softmax layer. We then fine-tuned these networks for 1000 categories of object classification using 1.3 million training images from ILSVRC [20], resulting in our final networks AlexNet-GAP, VGGnet-GAP and GoogLeNet-GAP.
   For classification, we compare our method with the original AlexNet [10], VGGnet [23] and GoogLeNet [24], and provide the results of Network in Network (NIN) [13] . For localization, we compare with the original GoogLeNet3, NIN and using backpropagation [22] instead of CAMs. Furthermore, to compare global average pooling and global max pooling, we also provide the results of GoogLeNet (GoogLeNet-GMP) trained with global max pooling.
   We use the same error metrics (top-1, top-5) as ILSVRC to evaluate the classification and localization performance of our network. For the classification task, we evaluate on the validation set of ILSVRC; while for the localization task, we evaluate on the validation set and test set.

3.2. Results

   We first report the results of object classification to demonstrate that our method does not significantly degrade the classification performance. We then show that our method is effective in weakly supervised object localization.
    Classification: Table 1 summarizes the classification performance of the original network and our GAP network. We found that in most cases, performance dropped by 1-2% when removing the extra layers of various networks. We observe that AlexNet suffers the greatest impact after removing the fully connected layer. To compensate for this, we added two convolutional layers before GAP, resulting in the AlexNet*-GAP network. We found that the performance of AlexNet*-GAP is comparable to AlexNet. Therefore, overall, we find that our GAP network maintains classification performance to a large extent. In addition, we observed that GoogLeNet-GAP and GoogLeNet-GMP have similar performance in classification, which is in line with expectations. It is important to note that it is important for the network to perform well on the classification task to achieve high localization performance, as it involves accurately identifying object categories and bounding box locations.

Table 1. Classification error rate on ILSVRC validation set

Insert image description here
    Localization: In order to perform localization, we need to generate a bounding box and its associated object category. To generate bounding boxes from CAMs, we use a simple thresholding technique to segment the heatmap. We first segment regions with values ​​higher than 20% of the CAM maximum, and then take the bounding box covering the largest connected component in the segmentation map. We did this for each of the top five prediction categories for the top five categories for the top five positioning evaluation metrics. Figure 6(a) shows some example bounding boxes generated using this technique. The localization performance on the ILSVRC validation set is shown in Table 2, with example output in Figure 5.
Insert image description here
Figure 6. a) Positioning example from GoogleNet-GAP. b) Comparison of localization from GooleNet-GAP (top two) and backpropagation using AlexNet (bottom two). Green represents the ground-truth bounding box and red represents the predicted bounding box from the class activation map.

Table 2. Positioning errors on the ILSVRC validation set. Backprop refers to using [22] for positioning instead of CAM.

Insert image description here
Insert image description here

Figure 5. Class activation maps from CNN-GAP and class-specific saliency maps from the backpropagation method.

  We observe that our GAP network performs best among all baseline methods, with GoogLeNet-GAP achieving 43% in terms of top-five localization error rate, which is very significant, especially since this network does not perform well on a single annotation bounding box. for training. We observe that our CAM method significantly outperforms the backpropagation method in [22] in performance (see Figure 6(b) for output comparison). Furthermore, we observe that GoogLeNet-GAP significantly outperforms GoogLeNet on the localization task, although the opposite is true in the classification task. We believe that the low mapping resolution (7 × 7) of GoogLeNet hinders it from obtaining precise localization. Finally, we observe that GoogLeNet-GAP performs significantly better than GoogLeNet-GMP on the localization task, illustrating the importance of global average pooling over global max pooling when identifying the range of objects.

Insert image description here
Figure 6. a) Positioning example from GooleNet-GAP. b) Comparison of GooleNet-GAP localization (top two) and backpropagation localization using AlexNet (bottom two). Green represents the ground truth bounding box and red represents the predicted bounding box from the class activation map.

  To further compare our method with existing weakly supervised [22] and fully supervised [24, 21, 24] CNN methods, we evaluate the performance of GoogLeNet-GAP on the ILSVRC test set. We adopt a slightly different bounding box selection strategy here: we select two bounding boxes (one compact and one loose) from the class activation maps of the predicted classes ranked 1 and 2, and one bounding box from the predicted class ranked 3 Choose a loose bounding box. We find that this heuristic helps improve performance on the validation set. The performance summary is shown in Table 3. GoogLeNet-GAP with heuristics achieves a top-five error rate of 37.1% in a weakly supervised environment, which is surprisingly close to the top-five error rate of AlexNet (34.2%) in a fully supervised environment. While impressive, we still have a long way to go in terms of localization when compared to fully supervised networks with the same architecture (i.e., weakly supervised GoogLeNet-GAP vs fully supervised GoogLeNet).

Table 3. Localization errors of different weakly supervised and fully supervised methods on the ILSVRC test set.

Insert image description here

4. Deep features for general positioning

  The responses of high-level layers of CNN (e.g. AlexNet’s fc6, fc7) have been proven to be very effective general features on various image datasets, with state-of-the-art performance [3, 19, 34]. Here, we show that the features learned by our GAP CNN also perform well as general features and, additionally, identify discriminative image regions for classification, although they were not specifically trained for these specific tasks. . To obtain similar weights to the original softmax layer, we simply train a linear support vector machine (SVM) on the output of the GAP layer.
  First, we will compare the performance of our method with some baseline methods on the following scene and object classification benchmarks: SUN397 [27], MIT Indoor67 [18], Scene15 [11], SUN Attribute [17], Caltech101 [6], Caltech256 [9], Stanford Action40 [28] and UIUC Event8 [12]. The experimental setup is the same as in [34]. In Table 5, we compare the performance of features from our best network GoogLeNet-GAP with AlexNet’s fc7 features and GoogLeNet’s ave pool features.

Table 5. Classification accuracy of different depth features on representative scene and object datasets.

Insert image description here
  As expected, GoogLeNet-GAP and GoogLeNet perform significantly better than AlexNet. Furthermore, we observe that GoogLeNet-GAP and GoogLeNet perform similarly despite the former having fewer convolutional layers. Overall, we find that GoogLeNet-GAP features are competitive with the state-of-the-art as general-purpose visual features.
  More importantly, we wanted to explore whether the localization maps generated using our CAM technology and GoogLeNet-GAP are informative in this context. Figure 8 shows some example plots for various datasets. We observe that the most discriminative regions are highlighted in all datasets. Overall, our approach is effective for generating localizable deep features for general tasks.
Insert image description here
Figure 8. Use our GoogLeNet-GAP deep features (already trained to recognize objects) for universal discriminative localization. We show 2 images for each of the 3 categories from the 4 datasets, along with the class activation maps underneath them. We observe that discriminative regions of images are often highlighted, for example, in Stanford Action40, the mop is positioned for cleaning the floor, while for cooking, pans and bowls are positioned, and similar can be done in other datasets. observe. This demonstrates the general localization capabilities of our deep features.

  In Section 4.1, we explore fine-grained identification of birds and show how we evaluate general localization capabilities and exploit it to further improve performance. In Section 4.2, we show how to use GoogLeNet-GAP to identify common visual patterns in images.

4.1. Fine-grained identification

  In this section, we apply our universal localizable deep features to identify 200 bird species in the CUB-200-2011 [26] dataset. The dataset contains 11,788 images, of which 5,994 are used for training and 5,794 for testing. We chose this dataset because it also contains bounding box annotations, allowing us to evaluate our localization capabilities. Table 4 summarizes the results.

Table 4. Fine-grained classification performance on CUB200 dataset. GoogLeNet-GAP can successfully locate important image areas and improve classification performance.

Insert image description here
  We find that when using the full image without any bounding box annotations, GoogLeNet-GAP performs on par with existing methods, with an accuracy of 63.0%. When using bounding box annotations, this accuracy increased to 70.5%. Now, considering the localization ability of our network, we can adopt a method similar to Section 3.2 (i.e., threshold method) and first identify bird bounding boxes in the training set and test set. GoogLeNet-GAP is then used again to extract features from image regions inside the bounding boxes for training and testing. We found that this significantly improved performance to 67.8%. In fine-grained recognition, localization capabilities are particularly important because the differences between categories are subtle and having a more focused image area enables better distinction.
  In addition, we found that under the 0.5 Intersection over Union (IoU) standard, GoogLeNet-GAP can accurately locate the birds in the image up to 41.0%, while the random performance is 5.5%. We show some examples in Figure 7. This further validates the localization ability of our method.
Insert image description here
Figure 7. Class activation maps (CAM) and inferred bounding boxes (red) for selected images of four bird species in the CUB200 dataset. In Section 4.1, we quantitatively evaluate the quality of bounding boxes (41.0% accuracy at 0.5 IoU). We found that extracting GoogLeNet-GAP features and retraining the SVM in these CAM bounding boxes improved bird class classification accuracy by approximately 5% (see Table 4).

4.2. Pattern discovery

  In this section, we explore whether our technique can identify common elements or patterns in images that go beyond objects, such as text or high-level concepts. Given a set of images that contain a common concept, we want to identify the regions that our network identifies as important and determine whether this corresponds to the input pattern. We adopt a similar approach to the previous one: train a linear SVM on the GAP layer of the GoogLeNet-GAP network, and apply CAM technology to identify important regions. We conduct three pattern discovery experiments using our deep features. The results are summarized below. Note that in this case we have no partitioning of training and test datasets - we are just using our CNN for visual pattern discovery.
   Discover information-rich objects in scenes: We selected 10 scene categories containing at least 200 fully annotated images from the SUN dataset [27], A total of 4675 fully annotated images. We train a one-to-many linear SVM for each scene category and use the weights of the linear SVM to calculate the CAM. In Figure 9, we plot the CAMs of predicted scene categories and list the top 6 objects that most frequently overlap with high CAM activation regions for both scene categories. We observed that regions of high activation often corresponded to indexical objects of specific scene categories.
Insert image description here
Figure 9. Information-rich objects for two scene categories. For the Restaurant and Bathroom categories, we show an example of the original image (top), along with a list of the 6 most frequent objects in that scene category, and how often they appear. The bottom shows the CAM and a list of the 6 objects that most frequently overlap with high activation areas.
   Concept localization in weakly labeled images: Using the hard negative sample mining algorithm in [32], we learn a concept detector and apply our CAM Techniques to locate concepts in images. To train a concept detector for phrases, the positive set contains images whose text titles contain phrases, while the negative set consists of randomly selected images without relevant words in their text titles. In Figure 10, we show the top-ranked images and CAMs for both concept detectors. Note that CAM locates information areas for these concepts, even though these phrases are much more abstract than typical object names.
Insert image description here
Figure 10. Information regions for concepts learned from weakly labeled images. Although these concepts are quite abstract, our GoogLeNet-GAP network is able to fully localize these concepts.

   Weakly supervised text detector: We use 350 Google StreetView images containing text from the SVT dataset [25] as the positive set, and 350 Google StreetView images containing text from the SUN dataset [27] A weakly supervised text detector is trained using randomly sampled outdoor scene images as the negative set. As shown in Figure 11, our method can accurately highlight text without using bounding box annotations.
Insert image description here
Figure 11. Learning a weakly supervised text detector. Although our network was not trained on text or any bounding box annotations, it was able to accurately detect text on images.

   Explaining visual question answering: We use our method and localizable deep features to build a baseline model for visual question answering that performs well on the test set of the Open-Ended track The overall accuracy is 55.89%. As shown in Figure 12, our method highlights image regions relevant to predicted answers.
Insert image description here
Figure 12: Example of image areas highlighted by predicted answer categories in visual question answering.

5. Visualize specific category units

  Zhou et al. [33] have shown that the convolutional units of each layer of CNN act as visual concept detectors, from low-level concepts such as textures or materials, to high-level concepts such as objects or scenes. Deeper in the network, these units become increasingly discriminative. However, with the presence of fully connected layers in many networks, it is difficult to determine the importance of different units in identifying different categories. Here, using global average pooling and ranked softmax weights, we can directly visualize the most discriminative units for a specific class. We call them class-specific units of CNN. Figure 13 shows the category-specific units of AlexNet*-GAP trained on the ILSVRC dataset for object recognition (top) and the Places database for scene recognition (bottom). We follow a similar procedure to [33] for estimating the receptive fields and segmenting each unit’s activation image on top of the last convolutional layer. We then simply use softmax weights to rank cells for a given category. From the graph we can identify the most discriminative object parts, and the cells that exactly detect these parts. For example, a unit that detects dog facial and body hair is important for a Lachlan Terrier; a unit that detects sofas, tables, and fireplaces is important for a living room. Therefore, we can infer that the CNN actually learns a set of words, where each word is a discriminative specific category unit. The combination of these category-specific units guides the CNN to classify each image.

Insert image description here

6. Summary

  In this work, we propose a general technique called Class Activation Mapping (CAM) for Convolutional Neural Networks (CNN) with global average pooling. This enables a classification-trained CNN to learn to perform object localization without using any bounding box annotations. Class activation maps allow us to visualize the predicted class scores on any given image, highlighting the discriminative parts of objects detected by the CNN. We evaluate our method on the ILSVRC benchmark, demonstrating that our global average pooling CNN is capable of accurate object localization. Furthermore, we demonstrate that the CAM localization technique generalizes to other visual recognition tasks, i.e., our technique produces universally localizable deep features that can help other researchers understand the discriminative basis of CNNs for their tasks.

Guess you like

Origin blog.csdn.net/qq_50993557/article/details/134864454