Fine-grained image classification of attention Recurrent-Attention-CNN

Recurrent Attention Convolutional Neural Network (RA-CNN)

Release time: 2017

Fine-grained image classification

concept

In terms of fine-grained image classification and traditional image classification, the discriminative parts in the image that need to be classified in fine-grained image classification are often only in a small area in the image.

In the traditional image classification network, no matter how much the important discriminative area in the image accounts for the whole image, only the whole image will be treated equally. Therefore, in some pictures where the discriminative area accounts for a small proportion of the image, performing the same feature extraction and processing, a large amount of unconcerned background information will be trained in, which increases the difficulty of image classification and reduces the accuracy of classification.

The introduction of the concept of fine-grained image classification is to solve this problem, pay attention to the small differences in the image, and achieve more accurate image classification: such as the classification of smaller varieties of vegetables and fruits, and the classification of plant diseases and insect pests. At present, the conventional fine-grained image classification method is to first obtain the area of ​​the target of interest, and then classify the target finely, so that the network can better understand the classified object.

difficulty

The difficulties of fine-grained image classification mainly include two aspects:
1. discriminative region localization; to be able to accurately locate those key regions;
2. finegrained feature learning from those regions. To be able to extract effective information from those key areas

Regional positioning method

At present, there are two main types of training methods for fine-grained image classification: strongly supervised learning and unsupervised learning.

  1. Strongly supervised learning
    Add more bounding-box label information to the network for strong supervised learning, so that the network can learn the location information of the target.
    Disadvantages:
    (1) A lot of human resources are required for image annotation.
    (2) The information marked by humans is not necessarily accurate, which is the area where the target really needs attention.

  2. Weakly supervised learning Weakly supervised learning is
    not unsupervised learning, but just based on the basic image classification network, only need to give the image category. The network determines the location of the discriminating area through unsupervised learning, and then pays special attention to the feature difference of this area to identify the target category.
    Commonly used methods are: image classification based on attention (Attention) mechanism, by analyzing the most prominent part of the feature map to obtain the position of the discriminative region, such as RA-CNN in this article

Recurrent-Attention-CNN basic structure

The entire network is divided into 3 scales, and the sub-networks of each scale have the same structure and different parameters. The convolutional features of the previous sub-network get the regional attention through the Attention Suggestion Network (APN). After the attention area is scaled, the bilinear interpolation continues to be the input of the next sub-network, recursively executed, and finally the volume of the three sub-networks Product features are fused (standardized, then spliced), and the fused features are classified through the fully connected layer and the softmax layer.

  • The feature extraction part of the classification sub-network uses VGG-19
  • The regional suggestion sub-network is implemented with two fully connected layers. The output channel of the last fully connected layer is 3 and the output is (tx, ty, tl), which represents the center coordinates and half of the length of the attention suggestion area.
    Insert picture description here

Loss function

The loss function of the network consists of two parts: intra-scale classification loss and inter-scale ranking loss.
The inter-scale ranking loss is used as the supervision information for training APN, which can ensure that the network learns the discriminative regional attention without manual labeling.

  • The sample loss function is defined as follows:
    Insert picture description here

  • The ranking loss between scales is defined as follows:
    Insert picture description here
    Lrank stands for pairwise ranking loss. You can see the red braces in FIgure2. The first scale network and the second scale network form a Lrank, and the second scale network and the third scale network form the same Another Lrank.
    Insert picture description here
    Input pt (s) p_t^{(s)}pt(s)The ttt represents the real label category,sss means scale, so for examplept (2) p_t^{(2)}pt(2)Represents the true label probability of the second scale network (the green part in the last predicted probability bar graph in Figure 2). It can be seen from the Lrank loss function that when the pt p_t of the later scale networkptGreater than the pt p_t of the adjacent front scale networkptWhen the loss is small, the training goal of the model is to hope that the prediction of the later scale network is more accurate . The margin parameter is set to 0.05 in the experiment.

Training strategy

  1. Initialize the classification sub-network using the same VGG network pre-trained on Imagenet

  2. Search for the area with the highest response value of the last convolutional layer on the original image, obtain a smaller area at other scales in the same way, and use these selected areas to pre-train the area recommendation network (APN)

  3. Iterative training of classification sub-network and region suggestion sub-network, first fix the APN network, optimize the softmax loss at three scales, until convergence. Then fix the parameters of the classification sub-network to optimize the ranking loss. The two-part learning process is iterative until both losses do not change.

Guess you like

Origin blog.csdn.net/weixin_42764932/article/details/112214174