Research paper on fine-grained image classification-2019

Cross-X Learning for Fine-Grained Visual Categorization(by end-to-end)

Abstract

The current work tackles the fine-grained image classification problem in a weakly supervised manner: object parts are first detected, and then corresponding part-specific features are extracted for fine-grained classification.

However, these methods usually deal with some specific features of each image in isolation, while ignoring the relationship between them.

This paper proposes Cross-X learning, which is simple yet effective, exploiting the relationship between different images as well as the relationship between different network layers to achieve robust multi-scale feature learning.

Our approach consists of two new components:

  1. A cross-category cross-semantic regularizer that guides the extracted features to represent semantic parts;
  2. A cross-layer regularizer that improves the robustness of multi-scale features by matching the predicted distributions of multiple layers;

Introduction

There are two main methods of weakly supervised FGVC. One is to use the relationship between fine-grained labels to adjust feature learning; the other is to locate discriminative parts for feature extraction.

Compared with label relationship-based methods, localization-based methods have the advantage of extracting fine-grained features locally.

In the early localization-based methods, in order to obtain local detectors, DCNN features are generally trained first to obtain the implicit representation of DCNN, and then local features are extracted for fine-grained classification.

The current approach merges these two stages into an end-to-end learning framework, exploiting the final objective to simultaneously optimize local localization and fine-grained classification.

But these methods define semantic parts independently in each image and ignore the relationship between local specific features from different images. Although the article "Multi-attention multi-class constraint for fine-grained image recog-nition" proposes a model based on soft attention to explore the relationship between parts.

This paper proposes cross-X learning, which exploits the relationship between different images and between different network layers to achieve robust fine-grained recognition. Similar to "·", the method in this paper first generates attention region features through multiple incentive modules, but it further involves two new components: cross-category cross-semantic regularizer (C3S) and cross-layer regularizer (CL ).

C3S to guide attentional features from different firing modules to represent different semantic parts.

Specifically, ideally, the attention features of the same semantic part should be more correlated than the attention features of different semantic parts, the effect is shown in the orange box.
Please add a picture description

So the working principle of C3S is to maximize the correlation between the attention features extracted by the same incentive module, and decorrelate the attention features extracted by different incentive modules at the same time.

At the same time, this paper exploits the relationship between different network layers for robust multi-scale feature learning.

To further improve the robustness of multi-scale features, this paper introduces a CL regularizer, which matches their predicted distributions by minimizing the KL divergence of mid-level features and high-level features.

Approach

Cross X learning consists of two parts, namely C3S: learning semantic part features by exploiting the correlation between different images and CL: learning robust features by matching prediction distributions between different layers.

The overview looks like this:

Please add a picture description

Preliminaries

The OSME part will not be repeated.

Although the OSME part can generate attention-specific features, it is a challenge to guide these features to have semantics.

The original paper solves this problem by optimizing the metric loss, bringing the attention features from the same branch closer and the attention features from different branches farther away. However, optimizing such a loss remains a challenge and involves a sample selection process.

Cross- Category Cross-Semantic Regularizer

Instead of optimizing the metric loss, this paper proposes to learn semantic features by exploring the correlation between feature maps of different images and different extraction modules.

We want features extracted from the same excitation module to have the same semantics even if they come from different images with different class labels. And features extracted from different excitation modules should have different semantics even if they come from the same image.

We got the attention feature U p U_p from OSMEUp. Then we perform global average pooling on it to obtain the corresponding pooling feature fp ∈ RC f_p\in R^CfpRC. _ Then perform l2 normalization. Then form a matrix for all pairs of excitation modules:

Please add a picture description

其中 F p = [ f p , 1 , . . . , f p , N ] ∈ R C × N F_p=[f_{p,1},...,f_{p,N}]\in R^{C\times N} Fp=[fp,1,...,fp,N]RC × N . N is the number of samples.

Please add a picture description

It feels weird here. If the two F matrices come from different excitation modules, the diagonal lines should represent the characteristics of the same sample in different excitation modules. In order to make the features of different excitation modules have different semantics, they should not be made. Is it less relevant? ? ?

Cross-Layer Regularizer

It is a good choice to use the semantic features of different convolutional layers in CNN. The most straightforward strategy in fine-grained recognition tasks is to combine predictions from different layers as the final prediction. This is proved to be impossible in the experiments in this paper.

This paper believes that there are two reasons for this:

  1. Mid-level features are more sensitive to input variations, which makes them less robust to fine-grained recognition under large intra-class variations;
  2. Relationships between feature predictions are not exploited.

In this paper, Feature Pyramid ( FPN ) is adopted to integrate features from different layers, and CL is proposed to learn robust features by matching prediction distributions between different layers.

Its operation is shown in Figure 1, where UGU^GUG integrates the features of fine spatial resolution in the middle layer and rich high-level semantics in the top layer.

Afterwards, KL divergence is adopted to affect the relationship between feature predictions through the distribution of different layers:

Please add a picture description

ULU^LUL U G U^G UThe same can be done between G.

The CL regularizer can be viewed as a kind of knowledge distillation, which uses the knowledge from ULU^LUThe soft target of L to affectUL − 1 U^{L-1}UL1 U G U^G UG

Optimization

Our final predictions are obtained by combining three feature maps:

Please add a picture description

For the loss function, in addition to the loss of the excitation module and the distillation of the output of different layers, we also have the classification loss of the data, as follows:

Please add a picture description

Learning a Mixture of Granularity-Specific Experts for Fine-Grained Categorization(by localization-classification subnetwork)

Abstract

The purpose of this paper is to partition the problem space of fine-grained cognition into some specific regions. To achieve this, this paper develops a unified framework based on hybrid features. Due to the limited data available for fine-grained recognition problems, it is infeasible to use data partitioning strategies to learn different experts. To address this problem, this paper combines a learning strategy of expert gradual reinforcement and constraints based on Kullback-Leibler divergence to promote diversity among experts.

These methods drive experts to learn tasks from different aspects, enabling them to deal with different subspace problems.

Introduction

This paper proposes a unified framework based on mixed neural network experts (ME).

ME usually follows a divide and conquer strategy. Early ME strategies often train an expert on a unique subset, but due to the small subset, it often leads to the problem of overfitting.

To overcome the difficulty of learning different experts from limited data, this paper introduces a gradual boosting strategy along with KL divergence constraints to encourage diversity among experts.

The main idea of ​​the gradual enhancement strategy is that a new expert learns through the additional information knowledge or previous information obtained from the previous experts, so it is more professional.

Therefore, what needs to be considered is how to transfer task-related knowledge to latecomers.

In this paper, the attention map is selected from the convolutional model as the carrier of knowledge. Because it shows how the neural network associates certain regions of the image with the target task.

Another way to promote expert diversity is to penalize the similarity of probability distributions. This can be achieved simply by maximizing the KL divergence between experts.

Related Works

Mixture of experts is built on the principle of divide and conquer. At present, there are preliminary applications in the field of FGVC.

Approach

Our method consists of several experts and a gate network. A gate network is used to combine these experts to make the final decision.

This article designs experts based on two principles.

One is that in order to better perform fine-grained recognition, a good representation needs to be learned, which contains detailed information. In order to achieve better, this paper needs to extract large local features and small local features, and experts make decisions based on their combination.

The second is that an expert can generate prior knowledge to build another expert.

Please add a picture description

Experts for Fine-Grained Recognition

First, we need to build a powerful feature extractor.

For expert E t E_tEt, this paper uses a deep convolution module and global average pooling from large-part region fgt f_g^tfgtfeature extraction. Then, this paper uses a shallow convolution module and a global maximum pooling to extract features from small-part region KaTeX parse error: Double subscript at position 4: f_l_̲t .

By applying different global pooling methods, they will learn different kinds of features from the same image.

The final joint feature is ftf^tft , obtained by:

Please add a picture description

The classification loss about experts mainly includes two auxiliary losses and a decision loss:

Please add a picture description

All losses are cross-entropy losses. Each sample calculates a loss for the output of the three features, and finally divides it by the number of samples.

The latter expert learns from the data using prior information from the previous expert. Prior knowledge is transferred to the latter expert via gradient-based attention. The way this paper constructs the attention map follows Grad-CAM . This method uses the gradient information of an ideal convolutional layer to understand the importance of each neuron in the decision.

Grab CAM

Grad CAM can draw a heat map (region of interest) of the network to study the nature of the network.

Its overall process is as follows:

insert image description here

After the classification vector is obtained, backpropagation is performed to obtain the contribution of each position in A (feature map) to the result. Then we perform global average pooling on each channel to get the weight. Then, a compression of the gradient map is performed according to the weight and feature map. Finally, some relu and normalization processing can be performed to obtain the heat map.

We come back to this article.

The weight calculation for category c and the kth channel is as follows:

Please add a picture description

Before global pooling, perform ReLU operation:

Please add a picture description

Thus, the expert E t E_tEtcan be expressed as:
Please add a picture description

After getting the attention map, further mark dim by scaling between 0 and 1:

Please add a picture description

Sampling the attention map to the input image size.

In the training phase, we backpropagate the gt to compute the attention map, and in the testing phase since the gt is not accessible, the predicted labels are used.

Then introduce how to transfer the prior knowledge of the previous expert to the next expert:

Given the attention map (heat map), this paper uses a similar weakly supervised localization technique to construct the input for the next expert. This is done to include more effective regions instead of only detecting partial regions: we choose to cover the largest connected region to get bounding box, and then crop according to the coordinates of the bounding box.

The idea of ​​this part is exactly the same as that of RA-CNN.

KL-Divergence

To promote diversity among experts, this paper introduces a constraint based on KL divergence to penalize experts who produce the same probability distribution on input images.

The KL divergence is expressed as follows:

Please add a picture description

Due to limited training data, each expert tends to make a very confident prediction, resulting in a one-hot vector.

Such results do not reflect the model's description of the inherent structure of the data. Therefore, we remove the maximum value and normalize it to the new distribution. This distribution better reflects the model's description of the data. Therefore, maximizing the KL divergence of the two distributions is equivalent to encouraging the two models to describe the data differently.

Specifically, the model does the following operations:

First build a binary mask:

Please add a picture description

This can also be seen as a gating operation to select distributions for optimization.

Then, the constraints based on KL divergence are as follows:

Please add a picture description

Then, the loss function can be expressed as:

Please add a picture description

That is to say, the gap between other parts except the class element should be as large as possible.

Mixture of Experts

The final optimization goal is as follows:

Please add a picture description

The third term is the gating loss function, which is used to learn the gating network:
Please add a picture description

Summarize

The overall framework is actually the same as RA-CNN, except that the attention uses the heat map extracted by Grad-CAM.

The starting point of this article should be an advancement of ME methods, focusing on solving the over-fitting problem of ME.

Looking for the Devil in the Details:Learning Trilinear Attention Sampling Network for Fine-grained Image Recognition(TASN)(by localization-classification subnetwork)

Abstract

The previous Attention-based method was limited by the two major problems of large local quantity and computational complexity. This paper proposes a new model TASN:

  1. Trilinear attention module: generate attention maps by modeling inter-channel relationships; localize details
  2. Attention-based sampler: highlight parts of interest at high resolution; extract details
  3. Feature Extractor: Partial features are extracted as object-level features through weight sharing and feature preservation strategies. Optimization details

Introduction

Most efforts in existing methods have focused on learning to better represent such subtle and discriminative details.

Existing attention-based or part-based methods try to solve this problem by learning part detectors, clipping and enhancing parts, and concatenating local features for recognition.

There are still some issues:

  1. The number of attention is limited and predefined, which limits the effectiveness and flexibility of the model;
  2. Due to the lack of local annotations, it is difficult to learn multiple consistent attention maps (say, the same part of each sample);
  3. Learning a CNN for each part is not efficient.

The TASN proposed in this paper learns details from hundreds of local proposals and efficiently extracts these learned features into an ensemble.

TASN consists of three modules: linear attention module, attention-based sampler and feature extractor.

First, a trilinear attention module takes feature maps as input and generates attention maps via a self-trilinear product, which pools feature channels with their relation matrices. Since each channel of the feature map is converted into an attention map, hundreds of local proposals can be extracted. ( What is a relationship matrix? Why can local proposals be extracted? )

Second, an attention-based sampler takes an attention map and an image as input and “highlights” the part of attention with high resolution. For each iteration, an attention-based sampler generates a detail-preserved image based on a randomly selected attention map and a structure-preserved image based on an average attention map. The former learns part-specific fine-grained features, while the latter captures the global structure and contains all important details. ( What is high resolution? How to highlight? )

Finally, the local network and the main network can be further expressed as the relationship between teachers and students. The local network learns fine-grained features from the detail-preserved image and extracts the learned features to the main network.

The mainnet takes the structure-preserved image as input and refines specific parts in each iteration. This refinement is achieved through weight sharing and feature preserving strategies. ( How to implement the two strategies? )

It is worth noting that this paper adopts the means of knowledge distillation instead of simply connecting local features. This is because the locals are numerous and not predefined.

The advantages of these approaches are:

  1. Random detail optimization can be achieved, which makes it practical to learn details from hundreds of proposals;
  2. Enables effective inference, as the mainnet can be used for identification during the testing phase.

This work is the first attempt to learn fine-grained features from hundreds of local proposals.

Method

Please add a picture description

Details Localization by Trilinear Attention

Each channel of the convolution corresponds to a visual pattern, however, such feature maps cannot be used as attention maps due to lack of consistency and robustness.

Influenced by the channel grouping network in 2017, this paper transforms feature maps into attention maps by pooling feature channels according to their spatial structure.

This process is achieved through a trilinear formulation, hence the name trilinear attention module.

Before getting the attention map, this paper first uses CNN for feature extraction. In order to obtain high-resolution feature maps, this paper removes the two downsampling processes of resnet-18. In addition, in order to improve the robustness of convolutional responses, this paper adds two sets of dilated convolutional layers with multiple dilation rates to increase the receptive field.

First, this article reshape the input: R c × h × w → R c × hw R^{c\times h\times w}\to R^{c\times hw}Rc×h×wRc×hw

Then, the trilinear function can be described as follows:
Please add a picture description

where XXT XX^TXXT is a bilinear feature that represents the spatial relationship between channels.

In order to make the features more consistent and robust, this paper further performs the above point multiplication, so we can obtain a trilinear attention map.

Please add a picture description

In fact, the idea is similar to the channel grouping network, but this function is realized through the trilinear formula.

This paper further learns different regularization methods to improve the effectiveness of trilinear attention.Please add a picture description

where N represents softmax. The first N keeps each channel to have the same size. The second is relational regularization.

Finally, we reshape M to the original 3D.

Details Extraction by Attention Sampling(Important)

By sampling different attention maps, different types of maps can be obtained:

Please add a picture description
S refers to the non-uniform sampling function, A refers to the average pooling between channels, and R refers to randomly selecting a channel from the input.

The sampling process can be described as follows:

Please add a picture description

Given the attention map in (a), we first decompose the map into two dimensions by computing the maximum value on the x-axis (b1) and y-axis (b2). The integrals of (b1) and (b2) are then obtained and shown in (c1) and (c2), respectively. We further compute the inverse functions of (c1) and (c2) numerically, i.e., we uniformly sample points along the coordinate axes and obtain along the red arrows (as indicated by (c1) and (c2)) and the blue arrows The value on the axis.

Finally, we sample the intersection position of the blue dotted line to get (e), and we can see that the position focused by the attention matrix has been enlarged!

Details Optimization by Knowledge Distilling

This part takes detail-preserved image and structure-preserved image as input, and transfers the learned details from the local network to the main network in the way of distillation learning.

In this paper, the two graphs are first input into ResNet-50, fc and softmax to obtain classification vectors and probability vectors zs , zd , qs , qd z_s,z_d,q_s,q_dzs,zd,qs,qd, it is worth mentioning that the steps of softmax are somewhat special:
Please add a picture description

where T is a parameter called temperature, which is assumed to be 1 for general classification. Half will choose T to be relatively large, because a soft probability distribution can be produced.

Then soft classification loss and hard classification loss are used as loss functions:

Please add a picture description
Please add a picture description

Attention-based wiping randomly selects a part in each iteration, so all fine-grained details can be extracted into the main network during training.

Summarize

This paper starts from the image mode of CNN and replaces the traditional channel grouping network with a trilinear formula to obtain a compact local attention map;

After that, the biggest highlight of this article is that it uses this attention map back to modify the raw image to enlarge some key parts;

Finally, feature extraction is performed on the modified image to obtain classification results, which incorporates the idea of ​​distillation learning.

Because each detail-preserve image only enlarges one part, the main network can learn one part each time, so as the number of training increases, the main network can naturally learn hundreds of local features.

Learning Deep Bilinear Transformation for Fine-grained Image Representation(by end-to-end)

Abstract

This paper proposes deep bilinear transformation blocks that can be deeply stacked in convolutional neural networks to learn fine-grained image representations. The DBT block can evenly divide the input channels into several groups, which can greatly reduce the computational cost since the bilinear transformation can be represented by computing the pairwise interactions within each group.

The output of each block is further obtained by aggregating the within-group bilinear features with residuals from the entire input features.

Introduction

Pairwise interactions can be learned in multiple layers to enhance feature recognition;

By learning semantic groups and computing intra-group bilinear transformations, the model acquires pairwise interactions only within the most discriminative feature channels.

Selective Sparse Sampling for Fine-grained Image Recognition(by localization-classification subnetwork)

Abstract

This paper proposes a Selective Sparse Sampling module to capture diverse, fine-grained features.

It is implemented using convolutional neural networks, abbreviated as S3Ns.

With image-level supervision, S3Ns collect some peaks (local maxima) from class response maps to estimate informative receptive fields , and then learn a series of sparse attentions to capture detailed, visual evidence and preserve context.

Such evidence is selectively sampled, significantly enriching the learnable features and guiding the network to discover more subtle cues.

The previous article directly crops the local area, destroying the corresponding context information. This paper proposes to amplify the local features while retaining the surrounding environment information.

Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification from the Bottom Up(by localization-classification subnetwork)

Abstract

Deep convolutional networks trained with image-level labels tend to only focus on the most discriminative parts, while ignoring other informative object parts.

In this paper, a complementary local model is built in a weakly supervised manner, retrieving information suppressed by the local part of the main object detected by the convolutional neural network.

Given image-level labels, coarse object instances are first extracted, and Mask R-CNN and CRF-based segmentation are employed to perform weakly supervised object detection and instance segmentation.

The optimal partial model is then estimated and searched for each object instance while maintaining diversity as much as possible.

Finally, a bidirectional LSTM is constructed, and the localization of these complementary parts (by localization-classification subnetwork)

Abstract

The localization of discriminative regions is critical for fine-grained visual classification, which has two problems:

  1. Which regions are differentiated and sufficiently representative to differentiate from other subcategories;
  2. How many regions are needed for optimal classification performance.

In this paper, we propose a multi-scale and multi-granularity deep reinforcement learning approach. The method learns multi-grained discriminative region attention and multi-scale region-based feature representation. The main contributions are as follows:

  1. A multi-granularity discrimination positioning is proposed, and the positioning differentiation is performed through a two-stage deep reinforcement learning method, which discovers the distinguishing regions with multi-granularity in a hierarchical manner, and determines the number of distinguishing regions in an automatic and adaptive manner;
  2. Multi-scale representation learning helps locate regions at different scales and encode images at different scales, thereby improving fine-grained visual classification performance;
  3. Propose a semantic reward function by jointly considering attention and category information in the reward function;
  4. Further exploration of unsupervised discriminative localization.

Guess you like

Origin blog.csdn.net/weixin_46365033/article/details/127966439