Research paper on fine-grained image classification-2014


Since 2014, fine-grained classification tasks have entered the stage of deep learning, and gradually have genres.

Part-Based R-CNNs for Fine-grained Category Detection(by localization- classification subnetwork)

Abstract

Semantic part localization can facilitate fine-grained classification.

A pose normalized representation method has been proposed, but due to the difficulty of object detection, some object boxes need to be assumed during the test phase .

This paper overcomes this limitation by utilizing the deep convolutional features computed by the bottom-up region proposal method (roughly meaning to solve the dependence on the object box).

The proposed method learns global and local detectors , enforces geometric constraints between them , and predicts fine-grained categories from a pose-normalized representation .

Find parts, modify parts, and get features.

Introduction

Locating the various parts of the object is critical. It can establish the correspondence between instances, and at the same time, it can alleviate the change of posture and the difference of camera perspective to a certain extent. (As long as it is partially found, no matter what kind of posture he presents in the picture, the impact will not be great)

The bottleneck of many pose-normalized representations is essentially the ability to localize accurately. Poselet, DPM (star structure, root filter for object localization, deformable part module for local localization, here the deformable part module should be emphasized) these methods have been used to obtain accurate local localization . However, the bounding box of the label needs to be given during the test, and this positioning is accurate enough.

This article addresses the model's need for a bounding box .

In past tasks, local localization and local description were separated, and they are unified with the same depthwise convolution in .

Part-Based R-CNNs

In the RCNN approach, for a specific object category, the candidate detection xxThe CNN feature description of x ϕ ( x ) \phi(x)ϕ ( x ) is assigned the scorew 0 T ϕ ( x ) w_0^T\phi(x)w0Tϕ ( x ) . Among themw 0 w_0w0is a learnable vector of SVM weights for object categories.

In our approach, we conduct a strongly supervised setting. During the training process, we not only have a bounding box for the entire object, but also for the semantic local { p 1 , p 2 , . . . , pn } \{p_1,p_2,...,p_n\}{ p1,p2,...,pn} Also available.

Given these local annotations, all objects and each of their parts are initially treated as separate object categories during training: we train a one-to-many linear SVM based on features extracted from proposal regions.

If the coincidence rate with GT exceeds 0.7, it is marked as a positive example, and if the coincidence rate is less than 0.3, it is marked as a negative example. Therefore, for a single object category, we learn whole- object SVM weights w 0 w_0w0以及part SVM weights { w 1 , w 2 , . . . , w n } \{w_1,w_2,...,w_n\} { w1,w2,...,wn}

At test time, for each candidate box, we compute the scores of all root and part SVMs.

In this paper, R-CNN is used to detect and train the target as a whole and locally.

Geometric Constrains

Our goal is to identify object locations and part locations in test images.

Choose the boxes that maximize the score.

Please add a picture description

First, you need to limit the selection of the frame. First select the possible target frame and all the component frames in it. Then define the scoring function:

Please add a picture description

Next, we need to constrain the geometry. A window with a high score does not mean it is correct. Especially when there is occlusion. So consider using some scoring function to enforce constraint layout.

Please add a picture description

Fine-Grained Categorization

The features mentioned here and the one used when predicting the position. This paragraph focuses on the fact that this CNN model is pre-trained, and it has been fine-tuned after the completion.
Please add a picture description

  1. Give the candidate box;
  2. Feature extraction and scoring for all windows;
  3. Window screening based on constraints;
  4. Classify according to the features of the final window (the features at this time should be normalized by pose).

Compared with pure RCNN, the constraints on component detection and matching are added.

The training of the second part of the scoring process relies on local annotations.

Guess you like

Origin blog.csdn.net/weixin_46365033/article/details/127750050