Visual Understanding of Convolutional Neural Networks
Visualizing and Understanding Convolutional Networks
Visual Understanding of Convolutional Neural Networks
Zeiler, M.D., Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8689. Springer, Cham. https://doi.org/10.1007/978-3-319-10590-1_53
method
Using a standard fully supervised convolutional neural network, through a series of layers, the input image is mapped to the feature vector of the output class.
layer structure:
- The output of the previous layer is convolved with a series of learnable convolution kernels
- Via a non-linear activation function (relu)
- [Optional] Local giant pooling
- [Optional] Normalization between feature maps
Experimental setup:
- Dataset: ${x, y}$, y is a discrete variable with class labels
- Cross-entropy loss function to compare network output with ground truth labels
- Network parameters (convolution kernel, weight offset of FC layer) are trained through loss backpropagation and updated through gradient descent method
Visualization with deconvolution
- In order to understand the operation of a convolutional neural network, it is necessary to understand the feature activity of the intermediate layers.
- By feeding these activities back into the input pixel space, we show that input patterns cause specific activations in feature maps.
- Deconvolutional Networks (for unsupervised learning). In this paper, deconvolution does not have the ability to learn, but serves as a probe for the trained network.
The process, as shown in the figure below:
- The input image is passed to the convolutional neural network to calculate the features
- To examine a given convnet activation, set all other activations in the layer to zero and pass the feature map as input to an additional deconvolution layer
- Underlying refactoring causes selection of the activity for a given activation via (i) unpool, (ii) rectify and (iii) filter operations.
- Repeat the operation in the previous step until the input pixel space is reached.
- Unpooling: Anti-pooling. The max pooling operation in convolutional neural network is irreversible. The approximate inverse of the pooling operation is obtained by recording the position of the maximum value in the pooled region. As shown below:
- Rectification: Correction. Convolutional neural networks use the relu nonlinearity function to ensure that the feature maps are always positive. To obtain efficient feature reconstruction in each layer, a relu nonlinearity is also used to pass the reconstruction signal.
- Filtering: filter (convolution kernel). The convolution kernel performs convolution on the feature map of the previous layer. To invert this process, deconvolution uses the transpose of the same kernel for the correction map.
- In addition, no normalization operation is used in the whole reconstruction process.
network structure
- Input 224x224 for convolution operation
- Layer 1-5:
- relu activation function
- 3x3 max pooling with a stride of 2
- normalization operation
- convolution operation
- Two fully connected layers
Convolutional Neural Network Visualization
feature visualization
- Layer 1: Hierarchy of features in the network
- The second layer: color, edge information
- The third layer: more complex invariance, capture similar texture, text information
- The fourth layer: significant changes, more category-specific
- Fifth layer: the whole object
Feature evolution during training
The training process for the strongest activation (across all training instances) back-projected into the input pixel space in a given feature map
- Low-level features can converge in the first few epochs
- High-level features require quite a few epochs to converge
architecture choice
11x11, stride 4 convolution kernel appears aliasing artifacts. —> 7x7, step size 2
Occlusion experiment
Whether the model actually recognizes the position of the object in the image
experiment
1. ImageNet
2. Remove some layers
3. Model generalization (transfer learning)
- A small amount of data can achieve good performance
- The above does not work when the datasets have differences
4. Whether the features of different layers in the network are effective for classification
- Replace softmax layer with SVM