Research paper on fine-grained image classification-2015

The Treasure beneath Convolutional Layers:Cross-convolutional-layer Pooling for Image Classification

Abstract

The deep convolutional network finally uses the fully connected layer to activate the image or region representation, which has poor discrimination.

In this paper, the cross-convolution pooling technique is used to activate the convolutional layer.

Introduction

The pre-trained deep convolutional network can obtain a general image representation, but for specific tasks, how to further obtain the corresponding image representation is still a problem. The current common method is to use the activation of the fully connected layer as a new image representation.

Some studies have also shown that using convolutional layer activations as image features does not work well.

This paper argues that if used properly, convolutional layer activations can also form a powerful image representation.

Therefore, this paper proposes cross-convolution layer pooling.

This structure relies on two key components:

  1. Use convolutional layer activations in the local feature setting, where sub-arrays of convolutional layer activations are extracted as region descriptors;
  2. The extracted local features are pooled using the activations of two consecutive convolutional layers.

The source of the idea of ​​​​the first component: Some work has proved that the activation of DCNN is not translation invariant. In addition, it is beneficial to extract fully connected activation from DCNN to describe local regions and pool multiple DCNN activations to represent images.

Therefore, this paper uses the part of the convolutional activation as the region description.

The second component is proposed based on a partial pooling method, which creates a pooled channel for each detected partial region. The final image representation is obtained by concatenating the pooling results of multiple channels.

In this paper, the feature map of the filter in the convolutional layer is used as the detection response map of the partial detector. and apply this feature map to the weights of the region descriptions extracted from the previous convolutional layer during pooling.

The pooling results from multiple channels, one for each feature map, are concatenated to obtain the final image representation.

Proposed Method

Convolutional layers vs. fully-connected layers

A major difference between convolutional layer activations and fully connected layer activations is that the former embeds rich spatial information.

Please add a picture description

In this paper, the local spatial units are taken out and connected vertically. And, pan on the graph in such a pattern.

Cross-convolutional-layer Pooling

After extracting local features from the convolutional layer, you can directly perform maximum pooling or sum pooling to obtain an image-level representation.

This paper proposes an alternative pooling method.

Multiple ROIs are detected first, and then the local features in the ROIs are pooled. Finally these vectors are concatenated together to form the image representation.

We define the pooled feature obtained from the kth ROI as P kt P_k^tPkt, which can be calculated as (sum-pooling):
P kt = ∑ i = 1 xi I i , k P^t_k=\sum_{i=1}x_iI_{i,k}Pkt=i=1xiIi,k
Among them xi x_ixiIndicates the i-th local feature, I i , k I_{i,k}Ii,kmeans xi x_ixiWhether it falls on the kth ROI.

However, it is not a simple matter to obtain this ROI, but we found that the activation results of the convolutional layer are semantic: (the
Please add a picture description
above image randomly extracts some feature maps from the 256 feature maps of conv5 and overlays them on the original image , for better visualization)

Therefore, let the t + 1 t+1t+The result of 1 convolutional layer isD t + 1 D_{t+1}Dt+1an instruction map.

Then, the image can be expressed as:
insert image description here

variable significance
P t P^t Pt Features after pooling of the tth convolutional layer
P k t P_k^t Pkt The feature of the k-th dimension after the pooling of the t-th convolutional layer
x i t x_i^t xit The i-th local feature of the t-th convolutional layer
a i , k t + 1 a_{i,k}^{t+1} ai,kt+1 The k-th dimension feature of the i-th spatial position of the t+1th convolutional layer

Here, local feature extraction is done in the tth convolutional layer, and N t N_t is obtainedNtlocal features, which is regarded as a kind of convolution, which is equivalent to the result of the t+1th convolutional layer.

After that, what is naturally done is to multiply the corresponding response by the feature, and then do a sum-pooling.

This article is also the reason why B-CNN adopts outer product processing. In fact, the processing ideas of the two are essentially the same, but one has the same source and the other has dual sources.

Bilinear CNN Models for Fine-grained Visual Recognition(by end-to-end feature encoding)

This article is the pioneering work of the single-stage method.

Abstract

This paper proposes a bilinear model, which consists of two feature extractors.

Its output is multiplied by the outer product at each location of the image and combined to obtain a description of the image.

Such a structure can model local, pairwise feature interactions (in a translation-invariant manner), which is especially useful for fine-grained classification.

Such a structure also produces various out-of-order texture descriptions, such as Fisher vectors, VLAD, and O2P.

The bilinear form simplifies gradient computation and allows two networks to be trained using only image labels.

illustrate

The description of the outer product in the following text may be different from that in general line substitution textbooks. The outer product involving the right-hand rule is called cross product, which mostly appears in analytic geometry. And here refers to the outer prduct, which is the multiplication between vectors.

Introduction

Current methods mainly detect parts and then model the appearance of these parts. The biggest flaw of such a method is that it is extremely dependent on manual annotation (it is costly and not necessarily suitable for recognition tasks.).

Another approach is to apply a more robust image representation. Traditional representations include Fisher vector, VLAD, SIFT, etc. The current better method is convolutional neural network.

This model is well suited for texturing and fine-grained tasks in a translation-invariant manner.

Comparative analysis : method 2 solves the biggest drawback of method 1, but the performance is not good, especially for images with small targets or relatively cluttered scenes. However, the impact on the end-to-end structure of Method 2 is not well studied.

The output of this paper captures pairwise local feature interactions.

This article believes that the architecture of this article is related to the dual-stream hypothesis of human brain visual processing, and the ventral stream involves target detection and recognition. The dorsal stream involves manipulating the spatial position of objects relative to the observer.

Bilinear models for image classification

insert image description here

Guess you like

Origin blog.csdn.net/weixin_46365033/article/details/128009771