Visual Understanding of Convolutional Neural Networks

Visualizing and Understanding Convolutional Networks

Visual Understanding of Convolutional Neural Networks

Zeiler, M.D., Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8689. Springer, Cham. https://doi.org/10.1007/978-3-319-10590-1_53

Refer to Brother Tongji Zihao [Intensive Reading AI Papers] ZFNet Deep Learning Image Classification Algorithm


method

Using a standard fully supervised convolutional neural network, through a series of layers, the input image is mapped to the feature vector of the output class.

layer structure:

  • The output of the previous layer is convolved with a series of learnable convolution kernels
  • Via a non-linear activation function (relu)
  • [Optional] Local giant pooling
  • [Optional] Normalization between feature maps

Experimental setup:

  • Dataset: ${x, y}$, y is a discrete variable with class labels
  • Cross-entropy loss function to compare network output with ground truth labels
  • Network parameters (convolution kernel, weight offset of FC layer) are trained through loss backpropagation and updated through gradient descent method

Visualization with deconvolution

  1. In order to understand the operation of a convolutional neural network, it is necessary to understand the feature activity of the intermediate layers.
  2. By feeding these activities back into the input pixel space, we show that input patterns cause specific activations in feature maps.
  3. Deconvolutional Networks (for unsupervised learning). In this paper, deconvolution does not have the ability to learn, but serves as a probe for the trained network.

The process, as shown in the figure below:

  1. The input image is passed to the convolutional neural network to calculate the features
  2. To examine a given convnet activation, set all other activations in the layer to zero and pass the feature map as input to an additional deconvolution layer
  3. Underlying refactoring causes selection of the activity for a given activation via (i) unpool, (ii) rectify and (iii) filter operations.
  4. Repeat the operation in the previous step until the input pixel space is reached.
    insert image description here
  • Unpooling: Anti-pooling. The max pooling operation in convolutional neural network is irreversible. The approximate inverse of the pooling operation is obtained by recording the position of the maximum value in the pooled region. As shown below:
    insert image description here
  • Rectification: Correction. Convolutional neural networks use the relu nonlinearity function to ensure that the feature maps are always positive. To obtain efficient feature reconstruction in each layer, a relu nonlinearity is also used to pass the reconstruction signal.
  • Filtering: filter (convolution kernel). The convolution kernel performs convolution on the feature map of the previous layer. To invert this process, deconvolution uses the transpose of the same kernel for the correction map.
  • In addition, no normalization operation is used in the whole reconstruction process.

network structure

insert image description here

  • Input 224x224 for convolution operation
  • Layer 1-5:
    1. relu activation function
    2. 3x3 max pooling with a stride of 2
    3. normalization operation
    4. convolution operation
  • Two fully connected layers

Convolutional Neural Network Visualization

feature visualization

  1. Layer 1: Hierarchy of features in the network
  2. The second layer: color, edge information
  3. The third layer: more complex invariance, capture similar texture, text information
  4. The fourth layer: significant changes, more category-specific
  5. Fifth layer: the whole object

insert image description here

Feature evolution during training

The training process for the strongest activation (across all training instances) back-projected into the input pixel space in a given feature map

  • Low-level features can converge in the first few epochs
  • High-level features require quite a few epochs to converge
    insert image description here

architecture choice

11x11, stride 4 convolution kernel appears aliasing artifacts. —> 7x7, step size 2
In Drawable AlexNet

Occlusion experiment

Whether the model actually recognizes the position of the object in the image

insert image description here


experiment

1. ImageNet

insert image description here

2. Remove some layers

3.

3. Model generalization (transfer learning)

insert image description here
insert image description here

  • A small amount of data can achieve good performance
    insert image description here
  • The above does not work when the datasets have differences
    insert image description here

4. Whether the features of different layers in the network are effective for classification

  • Replace softmax layer with SVM
    insert image description here

Guess you like

Origin blog.csdn.net/qq_38869560/article/details/128320042