Network Dissection paper reading notes

1 Introduction

  This is an article in CVPR2017 about deep learning interpretability research. The author quantifies the interpretability of CNN hidden representations by evaluating the correspondence between a single hidden neuron (unit) and a series of semantic concepts (concept).

2. Network analysis

2.1 Measures of Interpretability of Deep Visual Representations

  1. Identify a broad set of human-labeled visual concepts.
  2. Collect hidden neuron responses to known concepts.
  3. Quantify how (hidden neurons, concepts) are mapped.

2.2 Dataset

  The author established a complete test data set called Broden (Broadly and Densely Labeled Dataset), each picture has pixel-wise calibration at the scene, object, material, texture, color and other levels. A sample example from the Broden dataset is shown in the figure below.
insert image description here

2.3 Interpretable neuron scoring

  Feed each image in the data set to the network to be analyzed, get the response results on each feature map, further analyze the semantic relationship corresponding to the feature map of this layer, and summarize the results. The overall process is shown in the figure below.
insert image description here

For each input image xx   in the Broden datasetx , collect each inner convolution kernelkkThe activation map of k A k ( x ) A_k (x)Ak( x ) . Then the distribution of the activation volume of a single convolutional unit ak a_kis calculatedak. For each unit kkk , at each spatial location of the activation map in the dataset, byP ( ak > T k ) = 0.005 P(a_k>T_k)=0.005P(ak>Tk)=0.005 determines the upper quantileT k T_kTk.
  To compare the activation map of the low-resolution unit with the input resolution annotation mask L c L_cLcsome concepts of ccc , use bilinear interpolation to activate the feature mapA k ( x ) A_k(x)Ak( x ) scale up to input mask resolutionS k ( x ) S_k(x)Sk( x ) to fix the interpolation at the center of each cell's receptive field.
  ThenS k ( x ) S_k(x)Sk( x ) Carry out a binary value segmentation according to the threshold:M k ( x ) ≡ S k ( x ) ≥ T k M_k(x)≡S_k(x)≥T_kMk(x)Sk(x)Tk, select the activation feature map over the threshold T k T_kTkall areas of . For each pair ( k , c ) (k,c)(k,c ) Calculate the intersectionM k ( x ) ∩ L c ( x ) M_k(x) ∩ L_c(x)Mk(x)Lc( x ) , for each concept ccin the datasetc for evaluation. kk
  per unitk as conceptccThe segmentation score of c is calculated by the following IoU
k , c = ∑ ∣ M k ( x ) ∩ L c ( x ) ∣ ∑ ∣ M k ( x ) ∪ L c ( x ) ∣ IoU_{k ,c} = \frac{\sum|M_k(x) ∩ L_c(x)|}{\sum|M_k(x) ∪ L_c(x)|}IoUk,c=Mk(x)Lc(x)Mk(x)Lc(x)
Here∣ ⋅ ∣ |\cdot| is the cardinality of a set. Because the dataset contains some types of labels, these labels do not exist on some input subsets, but only on the image subset with at least one withccCompute the sum when c has the same concept label. I o U k , c IoU_{k,c}IoUk,cThe value of is the unit kkk detection conceptccAccuracy of c ; ifI o U k , c IoU_{k,c}IoUk,cexceeds a threshold (set to 0.04 in the text), we consider a unit kkk as conceptccc detector. Note that a unit may be a detector of multiple concepts (and a concept may be detected by multiple units); for analysis, we selected the top-ranked labels. To quantify the interpretability of a layer, we count the number of units that detect unique concepts, called the number of unique detectors.

3. Experiment

3.1 Human Evaluation of Interpretations

insert image description here
  At the lowest level, the low-level color and texture concepts available in Broden are only good enough to match a few units of good interpretation. Human agreement is also highest in conv5, suggesting that humans are better at recognizing and agreeing with high-level visual concepts such as objects and parts, rather than shapes and textures that appear in lower layers.

3.2 Measurement of Axis-Aligned Interpretability

  In order to explore whether the interpretability (Interpretability) of the network is related to the arrangement and distribution of units (units), the author performs random linear combination (Q in the figure below) for all units of a certain layer, that is, disrupts the arrangement, and then The disordered order is restored ( Q − 1 Q^{-1} in the figure belowQ1 ), observe the change of concept to get the result. Specifically as shown in the figure below:
insert image description here
Among them, the size of rotation represents the degree of random Q, and disrupting the arrangement of these units will not affect the final output of the network, and will not change the expressive ability of the network (discriminative power).

insert image description here
  It can be found from the results that as the rotation gradually increases, the number of unique detectors begins to decrease sharply, so the interpretability of the CNN network is affected by the ordering of the unit.

3.3 Understanding Layer Concepts

insert image description here
  Confirming the intuition, the color and texture concepts are dominant in the lower conv1 and conv2, while more object and part detectors appear in conv5.

3.4 Network Architecture and Supervision

insert image description here
  In terms of network architecture, we find interpretability ResNet > VGG > GoogLeNet > AlexNet. Deeper architectures seem to allow for greater interpretability.

insert image description here
  Self-supervised models create many texture detectors, but relatively few object detectors; clearly, self-supervised learning tasks are much less interpretable than supervised learning tasks on large annotated datasets.

3.5 Training Conditions vs Interpretability

insert image description here
  The figure above plots the interpretability of baseline model snapshots at different training iterations. We can see that the object and part detectors start to emerge at about 10,000 iterations (256 images per iteration). We found no evidence of transitions between different concept categories during training. For example, units in conv5 do not become texture or material detectors before they become object or part detectors.
insert image description here
  Repeat1, repeat2, and repeat3 in the above figure represent three different weight initialization methods, and the results indicate:

  1. Comparing different random initializations, the models converge to similar levels of interpretability in terms of both the number of unique and total detectors;
  2. For the network without Dropout, there are more texture detectors, but fewer object detectors;
  3. Batch normalization seems to significantly reduce interpretability.

3.6 Network Classification vs Interpretability

insert image description here
  As can be seen from the figure above, there is a positive correlation between classification ability and interpretability.

3.7 Layer Width vs Interpretability

insert image description here
  The convolution kernels of conv5 have been increased from 256 to 768, which has similar classification accuracy to standard AlexNet on the validation set, but there are many independent detectors and detectors on conv5; we also increase the number of units of conv5 to 1024 and 2048 , but the number of independent concepts did not increase significantly further. This may indicate the limited ability of AlexNet to separate explanatory factors; or it may indicate that limiting the number of disentangled concepts helps to solve the main task of scene classification.

4. Q&A

  In the following references [2], [3], [4], some questions answered by the author himself are recorded, which can help to better understand the article.

reference

[1] Network Dissection:
Quantifying Interpretability of Deep Visual Representations

[2] Paper notes: "Network Dissection: Quantifying Interpretability of Deep Visual Representations" - CSDN
[3] Explain its black box characteristics from the perspective of the essence of deep neural networks - Zhihu
[ 4] Zhihu God Zhou Bolei: Using "Network Dissection" to analyze the interpretability of convolutional neural networks

Guess you like

Origin blog.csdn.net/qq_49323609/article/details/131795734