Research paper on fine-grained image classification-2017

Higher-order Integration of Hierarchical Convolutional Activations for Fine-grained Visual Categorization(by end-to-end feature encoding)

Abstract

The success of fine-grained visual classification tasks relies on modeling the appearance and interconnection of various semantic parts.

This property makes the FGVC task very challenging for three reasons:

  1. Local annotation and detection require expert guidance;
  2. The local size is different;
  3. The local interactions are complex and high-order.

To address the above issues, this paper proposes an end-to-end framework based on higher-order integrals with hierarchical convolutional activations.

By using convolutional activations as local descriptions, hierarchical convolutional activations can serve as local representations from different scales. (What is convolutional activation? Why can it be used as a local description?)

In this paper, a predictor based on polynomial kernels is proposed to capture higher-order statistics for modeling interactions between parts. To model the interactions between parts in layers, this paper extends the polynomial predictor to integrate layer activations through the fusion of kernels.

Introduction

Fully connected networks are not suitable for FGVC, and local, discriminative patterns in CNNs are critical to obtain more powerful representations.

The current method generally locates the parts first, and then models the appearance of these parts, so as to obtain the corresponding local features on the deeper layer.

The global appearance structure is included to "pool these regional features" (probably meaning that the local features are extracted after the global features are extracted), this method is of course effective, but there are the following problems:

  1. A large number of part-based models rely heavily on meticulous local annotations to train accurate part detectors, which further limits the scalability of large-scale datasets; moreover, identifying discriminative parts is critical for specific fine-grained objects. Difficult, often requiring interaction with humans or experts.
  2. Discriminative and semantic parts usually appear at different scales; each spatial unit corresponds to a specific receptive field, so it is limited to describe different parts with a single convolutional layer;
  3. Developing joint configurations of individual object parts is important for object appearance modeling; some work proposes adding some geometric constraints. Existing methods only describe the first-order existence and relationship of a very small number of parts. Once there are more parts, the effectiveness of such a description will be weakened.

To sum up, there are three problems: the detection is partially dependent on manual work, the problem of scale, and the lack of appearance description.

Therefore, this paper proposes to focus on high-order statistics at different scales to provide a more flexible way to model the global appearance without requiring local annotations.

about nuclear

Recent research work argues that convolutional filters can be viewed as weak local detectors whose activation results are viewed as detection results.

In other words, this convolution can directly respond to the local area, so there is no need to deliberately pick out the local area, and just make a fuss about the result of the convolution.

This paper provides a matching kernel perspective to understand mapping and pooling structures combined with linear classifiers.

Linear mapping and direct pooling can only know whether this part appears . In order to obtain high-order relationships between parts, it is best to use local nonlinear matching kernels to represent high-order local interactions (co-occurrence). The difficulty lies in how to do a reasonable insertion in the CNN structure. In this paper, polynomial kernels are used to model higher-level local interactions.

About Multiscale

Existing studies and extensively demonstrate that fusing hierarchical features in neural networks is beneficial, mainly because of the different discriminative capabilities and coarse-to-fine descriptions of multiple convolutional layers.

However, existing methods simply concatenate or add multiple activations to represent the whole, or employ decision-level fusion to combine the outputs of different layers. Such methods are limited in reusing the intrinsic high-order relations of convolutional activations within or between layers.

Kernelized convolutional activations

In this paper, the convolutional filters are regarded as local descriptors, and the convolutional activations at each spatial location are regarded as partial descriptions. Therefore, this paper introduces a polynomial predictor to integrate a series of local matching kernels to model high-order component interactions.

Matching kernel and polynomial predictor

insert image description here

Subsequent learning methods and other content will not be repeated.

Hierarchical convolutional activations

Higher-order integration using kernel fusion

This section focuses on scaling solutions.

insert image description here
The angle brackets in the figure above represent the inner product, and the functions in them represent vectors.

In short, it is to perform intra-layer fusion and inter-layer fusion on the results of different output layers, and the formula should represent element-wise.

Personal summary

The contributions of this paper are twofold:

  1. Introduction of mixed polynomial kernel;
  2. Scale Fusion Strategy.
    Please add a picture description

Kernel Pooling for Convolutional Neural Networks(by end-to-end feature encoding)

Abstract

Convolutional neural networks with bilinear pooling are initially in full form before adopting compact representations.

Key to the success of this model is the spatial invariance modeling of pairwise (second-order) feature interactions.

In this paper a general pooling framework is proposed. Higher-order interactions of features are captured in the form of kernels.

This paper demonstrates how to approximate kernels such as Gaussian RBFs to a given order using parameter-free compact explicit feature maps.

Introduction

The concept of interactions between features is widely used as a high-order representation in learning tasks.

The motivation behind this is that subsequent linear classifiers operate on higher-dimensional feature maps , allowing for higher-order interactions. There are generally two ways to create higher order interactions.

The most common is through the nuclear trick .

But there are also two disadvantages:

  1. Both the required storage and evaluation time are proportional to the amount of training data, making it inefficient on large datasets;
  2. The construction of the kernel makes it difficult to use complex learning methods, including SGD.

Another approach is to explicitly map the feature vectors to a high-dimensional space using a product of features .

The disadvantage of this method is obvious: if the p-order interaction is performed on the d-dimensional feature vector, then this feature map will reach O ( dp ) O(d^p)O(dp ). This is impractical in the real world.

The former is to directly use low-dimensional data to obtain high-dimensional information, and the latter is to calculate high-dimensional data.

This paper proposes a compact and differentiable method for generating explicit feature maps. In practical applications, people often perform circular convolution in the frequency domain by fast Fourier transform and inverse fast Fourier transform. A paper demonstrates both theoretically and practically that the method succinctly approximates polynomial kernels .

Please add a picture description
The above figure is the specific implementation of the kernel pooling proposed in this paper. For each position in the feature map, Count Sketch is used to generate a compact feature map.

After applying kernel pooling, the inner product between two features can capture high-order feature interactions. This is equivalent to the formula in the figure below.

Please add a picture description

This makes the subsequent linear classifier highly discriminative.

The final feature vector is the result of global average pooling.

The work in this paper has two contributions:

  1. Propose a general kernel pooling method via compact explicit feature maps;
  2. The proposed kernel pooling method is differentiable and can be combined with CNN for joint optimization.

insert image description here

Kernel Pooling

This paper defines the concept of pooling as the process of encoding and aggregating feature maps into a global feature vector.

AlexNet/VGG adopts the strategy of fully connected layer + ReLU, which has a large amount of calculation and many parameters;

Inception/Residual Learning uses global average pooling, which has a considerable amount of calculation but does not capture high-order feature interactions;

A bilinear model that directly generates c 2 c^2 for a second-order polynomial kernelc2- dimensional features, and then use Tensor Sketch for approximation.

The model proposed in this paper goes beyond the bilinear model and captures high-order feature interactions. First, a Taylor series kernel is defined and its explicit eigenmaps can be approximated compactly. We then demonstrate how to use compact eigenprojections with Taylor series kernels to approximate commonly used kernels such as Gaussian radial basis functions.

Explicit feature projection via Tensor product

Please add a picture description
Because the dimensionality of the explicit representation is relatively high, a method for compressing the approximation is needed.

Compact approximate

Please add a picture description

Taylor series kernel

We define the Count Sketch of x as:
Please add a picture description

It can be seen that C(x) is a d-dimensional vector calculated by two hash functions. Their outputs are h:{1,2,...,d}, s:{+1,-1} respectively.

x of order p can be approximated as:
Please add a picture description

Among them, the small circle represents the product of element-wise.

Therefore, as the order increases, the total feature dimension increases linearly.

Gaussian RBF kernel

This section mainly talks about the Taylor kernel function can be approximated to the Gaussian kernel function. I won't repeat them here.

Summarize

The application of feature interaction was introduced in the 15-year B-CNN (previously there were also articles), which is an explicit application of kernel techniques (polynomial kernel), and this article is further extended to Taylor kernel.

Therefore, this paper essentially introduces the application effect of a new kernel technique in feature interaction.

Look Closer to See Better :Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition(by localization-classification subnetwork)

Abstract

Fine-grained classification tasks are difficult due to the challenges of localizing discriminative regions and learning fine-grained features.

Existing methods mostly address these issues independently, while ignoring the correlation between the two.

In this paper, we propose a novel recurrent attentional convolutional neural network that recursively learns attention and region-based feature representations for discriminative regions at multiple scales, and reinforces each other.

Learning for each scale consists of a classification subnetwork and an attention proposal subnetwork (APN). APN starts from the full image and iteratively generates region attention from coarse to fine with previous predictions as reference. While the finer-scale network extracts an enlarged, directed region from the previous scale as input.

Please add a picture description

Introduction

There are two main challenges for fine-grained recognition, namely:

  1. Distinguished regional positioning;
  2. Features are learned from these regions.

Previous work has achieved some success by introducing some local-based detection frameworks, which mainly consist of two stages:

  1. Detect possible target regions by analyzing the convolutional responses obtained in the neural network;
  2. Extract discriminative features from each region and encode them into a compressed vector for recognition.

The current results are very impressive, and further improvement is also very limited:

  1. Artificially defined parts or parts learned through supervising methods are not necessarily helpful for machine classification;
  2. Slight visual differences that exist locally are still difficult to learn. This paper finds that local detection and fine-grained feature learning are related and thus able to reinforce each other .

This paper proposes a method without bounding boxes/partial annotations, which is a recurrent attention convolutional neural network (RA-CNN).

RA-CNN is a stacked network whose input is from the whole image to multi-scale fine-grained local regions.

First, multi-scale networks share the same structure but each scale corresponds to different parameters to match this input with different resolutions.

The learning process at different scales consists of a classification subnetwork and an attention proposal subnetwork (enabling sufficient discriminative power at each scale and generating accurate attention regions for the next finer scale).

Second, a finer scale network specialized for high-resolution regions takes the simplified attention regions as input to extract more fine-grained features.

Finally, an intra-scale softmax loss is used to guide the classification network, and an inter-scale pairwise ranking loss is used to guide the attention proposal network.

Approach

Please add a picture description

Attention Proposal Network

Multi-task formulation

Inspired by the Region Proposal Network (RPN), this paper proposes an Attention Proposal Network.

Given a picture X, first extract region-based depth features, expressed as W c ∗ X W_c*XWcX

The first task is to generate a probability distribution p, which can be expressed as: p ( X ) = f ( W c ∗ X ) p(X)=f(W_c*X)p(X)=f(WcX)

The function f is expressed as a fully connected layer to map a convolutional feature to a feature vector and a softmax layer to further transform this feature vector into a probability value.

The second task is to predict a set of box coordinates for the region of interest for the next finer scale. By approximating the region of interest to a square with three parameters: [ tx , ty , tl ] = g ( W c ∗ X ) [t_x,t_y,t_l]=g(W_c*X)[tx,ty,tl]=g(WcX)
其中, t x , t y t_x,t_y tx,tyRepresents the center coordinates of the square, tl t_ltlIndicates the side length of the square.

where the function g is expressed as two stacked fully connected layers. It is worth mentioning that the learning of APN is trained in a weakly supervised manner.

Attention localization and amplification

Once the location of the ROI is hypothesized, we crop the ROI with a higher resolution and zoom in to a finer scale to extract finer-grained features.

To ensure that the APN can be optimized during training, this paper approximates the cropping operation by proposing a 2D boxcar function as an attention mask.

Please add a picture description

Based on the above characterization, the cropping operation can be performed by an element-wise multiplication of an original image at a coarse scale and an attention mask, which can be described as:

Please add a picture description

in,
Please add a picture description

When k is large, it can be regarded as a step function. It means that when x is greater than 0, it is approximately 1, and when it is less than 0, it is approximately 0.

This idea is really wonderful!

The boxcar function has two advantages:

  1. Approximate clipping operations well;
  2. Establish analysis representation between region of interest and box coordinates.

Although the regions of interest have been localized, sometimes it is still difficult to extract effective feature representations from highly localized regions. Therefore, we enlarge the region to a larger size by adaptive scaling . Specifically, this paper uses bilinear interpolation to calculate X att X^{att} through linear mappingXAmplified output of the last four inputs in att :

Please add a picture description

Classification and Ranking

The loss function for graph samples is defined as:
Please add a picture description

Y s Y^{s}Ys refers to the distribution vector of each scale,Y ∗ Y^{*}Y refers to gt.

Lcls is a classification loss, which mainly optimizes the parameters of the convolutional layer and the classification layer.

pts p_t^{s} derived from pairwise ranking lossptsRefers to the prediction likelihood on the label t. Specifically, this ranking loss can be expressed as:
Please add a picture description

This shows that the probability value of the local prediction should be higher than that of the upper level local.

Summarize

The overall framework of this article is a coarse-to-fine cyclic neural network, and each time step gives the image area that should be input in the next step.

Noteworthy classification details include:

  1. The classification loss is calculated for the output of each time step and the intensive reading of the local prediction is constrained by the sorting loss;
  2. The step function is approximated by the boxcar function, and a trainable crop strategy is given;
  3. Fine image input is optimized by means of interpolation.

Learning Multi-Attention Convolutional Neural Network for Fine-Grained Image Recognition(by localization-classification subnetwork)

Abstract

Identifying fine-grained categories is highly dependent on discriminative part localization and part-based fine-grained feature learning.

Existing methods mainly address these problems independently while ignoring that part localization and fine-grained feature learning are interrelated.

This paper proposes a novel local learning method based on multi-attention convolutional neural network, which makes the two complementary.

MA-CNN consists of a convolutional subnetwork, a channel grouping, and a partial classification subnetwork. Channel grouping networks take the output of convolutional layers as input feature channels, and generate multiple parts from spatially correlated channel pieces through clustering, weighting, and pooling.

Part classification networks further classify images by each individual part, through which more discriminative fine-grained features can be learned.

This paper also proposes two loss functions to guide channel grouping and part classification.

New point in the text :

  1. Using the characteristic that different visual information of different channels (channels) of the feature map is different, and the peak response area is also different, the channels with similar response areas are clustered to obtain part attentions
  2. This paper proposes a channel grouping loss, the purpose of which is to make the distance within the part closer (intra-class similarity), and the distance between different parts as far as possible (inter-class separability)

Introduction

The main problem with localization-classification based methods is that they rely on manual local annotations and such annotations are not necessarily good.

The current solution is divided into two paths, the first is to directly extract more refined features, and the second is to learn weakly supervised local detection.

This article believes that in the absence of clear local constraints, the discrimination ability of category-level CNNs limits the performance of partial positioning and feature learning (here refers to the second path). Furthermore, this paper finds that local localization and fine-grained feature learning are interrelated and can reinforce each other.

Head localization promotes specific patterns near the head, which in turn refine the head localization.

In this paper, we propose a multi-attention convolutional neural network based local learning method for fine-grained learning without bounding boxes/local annotations.

First, the patency of a convolutional feature channel is associated with a specific visual pattern. The channel grouping subnetwork thus clusters and weights spatially correlated patterns into local attention maps. These plots are from channels with peak responses at adjacent locations. The diverse high-response locations further constitute multiple local attention maps, from which we can crop some local proposals according to a fixed size.

Then, once local proposals are obtained, the local classification network further classifies images based on local features. These features are extracted from fully convolutional feature maps. Such a design can eliminate the dependence on other parts, and especially optimize a set of feature channels related to a certain part .

Finally, two optimization loss functions are jointly implemented to guide multi-task learning of channel grouping and part-based classification, which motivates MA-CNN to generate discriminative parts from feature channels and learn from parts in a mutually reinforcing manner. finer-grained features.

Approach

Please add a picture description

  1. The whole picture is sent to CNN to extract local feature representation;
  2. Generate multiple local attention maps (e) through channel grouping and weighting layers, and then use the sigmoid function to generate probabilities;
  3. Use the spatial attention mechanism to pool the local-based feature representation to obtain the feature (f);
  4. Finally, the probability score is calculated for each feature through the fully connected layer and the softmax layer.

What we need to focus on is how to get the local attention map and how to pool it.

Multi-Attention CNN for Part Localization

Although a convolutional feature map can correspond to a certain type of visual pattern, it is difficult to represent rich local information with only a single channel.

Therefore, this paper proposes a channel-grouped and weighted sub-network to cluster spatially correlated subtle patterns for compact and discriminative localization. Such a process in a nutshell is to find the group of channels whose adjacent ranges are at peak value from a group of channels.

Each channel can be represented as a position vector whose elements are the corresponding coordinates of the peaks of all training image instances, expressed as: [ tx 1 , ty 1 , . . . , tx Ω , ty Ω ] [t_x^1,t_y^ 1,...,t_x^{\Omega},t_y^{\Omega}][tx1,ty1,...,txOh,tyOh]

Each pair is represented as the coordinates of the peak response of the i-th image, and Omega is the size of the training set.

We can use the position vector as a feature and perform N clustering on different channels as N local detectors.

Please add a picture description

In order to ensure that the channel grouping can be optimized during training, this paper proposes a channel grouping layer to restore the permutation on channels with a fully connected layer to approximate this grouping.

We define a set of fully connected layers F = [ f 1 , . . . , f N ] F=[f_1,...,f_N]F=[f1,...,fN] , each of which takes convolutional features as input and outputs channel weights.
Please add a picture description

In this paper, the grouping result di ( X ) d_i(X) is obtained through two stepsdi(X)

  1. Pre-training parameters in formula (3), which is supervised learning by combining formula (2);
  2. Optimize through end-to-end learning.

That is, the way of pre-training + fine-tuning.

Then, the attention map of each part is represented as follows:
Please add a picture description

[ ⋅ ] j [ ]_j[]jDenotes the jth feature map, which represents a part attention map. That is, each channel is multiplied by the weight and then summed to obtain the following weight map:
Please add a picture description

The final feature representation of part i is obtained by element-wise multiplication of each channel with the weight map, and finally summing the weighted channel maps:

Please add a picture description

Mi(X) indicates the parts worthy of attention. For example, the head area is 1, and the rest are 0. After multiplication, the convolution features of the head area can be completely taken out.

Multi-task Formulation

Loss function

The loss function is defined as:
Please add a picture description

Among them, the loss function for the channel is as follows:

Please add a picture description

Dis is the distance function and Div is the diversity function. The DIs function encourages a compact distribution, and its specific form is as follows:
Please add a picture description
tx and ty are the coordinates with a large response on the attention map, which means that if your position has a large response, you should be close to this extreme position.

The purpose of Div is to obtain the diversity between feature maps:
Please add a picture description

Alternative optimization

This paper learns in a mutually reinforcing way.

First, the convolutional layer is fixed, and the image in the d stage of the image is optimized by formula 7. The purpose of this part is to locate the local area.

Then, we fix the channel grouping layer and use the classification loss to learn the b-stage in the graph, aiming at fine-grained feature learning.

This learning process is iterative until the loss function no longer changes.

Joint Part-based Feature Representation

We already know that each channel grouping will pool a local feature, so the final feature will connotate these local and global features to get the final integrated feature (research shows that this is beneficial).

Summarize

The idea of ​​this article is quite satisfactory, a bit similar to the idea of ​​the end-to-end method in 2015. Taking the meaning of each channel of CNN as the starting point, the channel group layer is designed, and the characteristics of each part are obtained by means of pooling. , so as to piece together the final features.

This article needs to pay attention to:

  1. The corresponding loss function is still set for classification and local feature acquisition;
  2. It is worth mentioning that its iterative training strategy is very clever.

Low-rank Bilinear Pooling for Fine- Grained Classification(by end-to-end feature encoding)

Abstract

Statistical pooling of second-order local features to form a high-dimensional bilinear feature has shown state-of-the-art performance.

To address the computational demands of high dimensions, this paper proposes to represent the covariance features as a matrix and apply a low-rank bilinear classifier.

The resulting classifier can be evaluated without explicitly computing bilinear feature maps, which can greatly reduce computation time and the amount of effective parameters that need to be learned.

To further compress the model, we propose a classifier for co-decomposition, which decomposes the bilinear classifier ensemble into a common factor and a compact term for each class. The idea of ​​co-factorization can be deployed with two convolutional layers and trained in an end-to-end architecture.

This paper also proposes a simple and efficient initialization method that avoids explicitly first training and factorizing larger bilinear classifiers.

Fine-Grained Recognition as HSnet Search for Informative Image Parts(by localization-classification subnetwork)

Abstract

Our work is based on the assumption that when dealing with subtle differences between object classes, it is critical to identify and interpret the few informative image parts. Because the rest of the image context may not only be uninformative, but also impair recognition.

This motivates us to formulate our problem as continuously searching for informative parts on deep feature maps generated by deep convolutional neural networks.

One state of this search is the set of proposed bounding boxes in the image, the informative ones are verified by the H function, and new candidate boxes are generated by the S function.

These two functions are unified into a new deep recurrent structure called HSnet through LSTM.

Therefore, HSnet generates proposals for informative image parts, fusing all proposals to the final fine-grained recognition. This paper specifies supervised and weakly supervised training of HSnet according to the availability of partial annotations.

Guess you like

Origin blog.csdn.net/weixin_46365033/article/details/127840354