[Paper Notes] ZFNet Paper Notes

ZFNet paper notes

introduction

Contribution

  • A set of methods for visualizing the feature map of the middle layer of the convolutional network is proposed, which proves that the feature map is not a random, uninterpretable pattern.
  • Debug the model through visualization, guide network design, and verify the effectiveness through experiments
  • Demonstrates the effectiveness of transfer learning

Summary

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark (Krizhevsky et al., 2012). However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

Method

Deconvolution visualization

  • In order to understand the activation of feature maps in the middle layer of CNN, the author proposed a method to reversely map the given feature map back to the input space through a deconvolutional network to obtain a picture of the same size as the input

    • CNN: Pixel->Features
    • Deconvolution network: Features->Pixels
  • The deconvolution network does not need to be trained, just use the parameters of the trained convolution network

  • In this part, the author designs inverse operations for the convolution, activation function and pooling parts of the forward propagation process of the convolutional neural network.

  • The overall architecture is shown in the figure. Each layer of the network has a corresponding deconvolution network (left)

    • image-20221216002330626
  • Operating procedures

    • Each time a picture is thrown in, and then a forward convolution operation is performed, each layer will have corresponding features.
    • Select the features of a certain layer**, select the activations of interest for this layer of features (the feature map corresponding to a certain convolution kernel)**, set the remaining values ​​to zero, and input this feature into the deconvolution network , get the reconstructed image
    • My personal understanding is that each convolution kernel will extract the corresponding feature map, and the semantic features extracted by different convolutions in different layers are also different. For example, some convolution kernels have a relatively large response to tires, and some neurons More responsive to people
unpooling
  • Since information will be lost during the max pooling process (for example, the maximum of 4 pixels is taken, and the remaining information is lost), which will cause it to be irreversible, the author uses the position of recording the maximum value (switch variables)
  • When restoring pooling, restore the value of each pixel after pooling to the position corresponding to the switch variable (bottom left in the figure below)
  • image-20221215225135884
relu activation function
  • During the forward propagation of the network, the relu nonlinear activation function is used to map the feature map to a positive
    • What is ReLU and Sigmoid activation function? - Nomidl
  • In the process of reconstructing the signal, the author still uses relu
    • Personally, I feel that to be more rigorous, I should use the inverse function of relu, but in fact it doesn't matter, because the signal to be reconstructed is positive by default, and the negative domain will not be used.
transposed convolution
  • The network forward propagation process uses a convolution kernel with parameters to convolve the input feature map of the previous layer to obtain the output feature map of this layer.
  • For the inverse operation, the transpose of the same convolution kernel is used here
principle
  • First look at an example of convolution, which slides on the image through a sliding window
    • img
  • This process can be expanded into the multiplication of two matrices A x = b Ax=bAx=b
    • img
    • In fact, straighten the original image and see where each pixel of the original image will be multiplied by the position of the convolution kernel.
  • Logically speaking, if you want to restore the original image xxx ,AAInverse matrix of A matrix, x = A − 1 bx=A^{-1}bx=A1 b, but the original article uses the transpose of the convolution kernel, that is,ATA^TAT , only the shape of the original image is restored.
    • Only used when the matrix is ​​an orthogonal matrix , A − 1 = ATA^{-1}=A^TA1=AT
    • In this article, the author believes that it is unreasonable to use deconvolution to describe transposed convolution. Deconvolution is the inverse of convolution.
summary
  • The element image reconstructed through deconvolution of the feature map will only be partially similar to the original image. This is because forward pooling will lose some information.
  • The reconstructed image is weighted based on the input image's contribution to the selected feature map. If a certain part of the original image matches the selected feature map more closely, this part will be brighter on the reconstructed image.
    • For example, if there is a convolution kernel that extracts the semantics of tires, then the feature map corresponding to this convolution kernel (the rest of the feature maps are set to 0) is reversely input into the image reconstructed by the deconvolution network, including the tire part. The picture will be brighter

Visualization

This article has a lot of experiments. The author conducted experiments on the ImageNet verification set.

Feature visualization

  • For different convolution kernels in different layers (1, 2, 3, 4, 5), the author found 9 pictures from the data set that made these convolution kernels respond the largest, and reconstructed the corresponding feature maps into Compare with original image
  • image-20221216012311791
    • For example, in this example, corresponding to certain 16 convolution kernels in the second layer, the nine pictures in the red box correspond to the largest response to a certain convolution kernel. The right side is the corresponding original image patch, and the left side is the reconstructed image.
  • This experiment shows that high-level feature maps can be invariant to input transformations.
    • High-level features extract similar semantics even if the input images are not exactly the same
  • The feature map obtained by a specific convolution kernel will extract specific parts of the features.
    • Through the reconstructed image, we can see that except for parts that are bright, the rest are dark.

Feature transformation during training process

image-20221216014155652

  • By comparing the convergence conditions of different layers, we can see that the lower layers only need a few epochs to converge, while the convergence of higher layers is slow (there is almost nothing in the feature map reconstruction of the first few epochs) and requires dozens of epochs. convergence

Features do not deform

image-20221216014812210

  • The author compared the translation (a1), scaling (b1), and rotation (c1) of the image patch.
    • Calculate the Euclidean distance of the feature vectors (feature map reshape) obtained before and after transformation
      • The second column is for the first floor, and the third column is for the seventh floor.
    • (4th column) Change in predicted probability
  • It can be seen that the feature vector distance obtained after the image transformation in the first layer is larger, while the distance changed by the seventh layer after the image is violently transformed is very small (note the inconsistency in dimensions)

Architecture Selection

all

Occlusion SensitivityOcclusion Sensitivity

all

Correspondence Analysis Correspondence Analysis

all

Guess you like

Origin blog.csdn.net/u011459717/article/details/128337667