Research paper on fine-grained image classification-2012

A Codebook-Free and Annotation-Free Approach for Fine-Grained Image Categorization

Abstract

Most of the current classification methods for general objects are not good for fine-grained image classification. This is mainly due to codebook-based image representation. This results in the loss of detailed image information that is crucial for fine-grained classification.

One way to solve this problem is to introduce human-annotated object attributes or keypoints.

This paper proposes a codebook-free and annotation-free fine-grained image classification method.

Instead of using dequantized codewords, image representations are obtained through a high-throughput template matching process.

Introduction

Please add a picture description

In this stage of work, for image representation, either use the global codebook, or manually mark some key parts.

In order to avoid the drawbacks of these two methods, a new method is proposed in this paper. The method captures subtle differences in object classes by directly matching image regions through an instructive and efficient template matching algorithm. Images are then represented as feature response maps of these high-throughput feature templates. Finally, the algorithm builds the final classifier by aggregating a set of classifiers using a novel bagging-based algorithm.

Our Approach

Template Matching Based Representation

Intuition

Extract rectangular regions from all training images to generate a large number of templates. The images are then represented by the response scores matched to each template.

Please add a picture description

This uses continuous template matching scores instead of discrete visual codewords to represent images.

This approach is able to capture subtle differences between similar image patches. Also, this is Codebook-Free vs Annotation-Free.

Please add a picture description

Templates and the Matching Approach

The image is defined as OOO , we label his appearance as S different feature types as{ O s } s = 1 S \{O^s\}_{s=1}^S{ Os}s=1S

The image template is expressed as T = { { O s } s = 1 S , { r , s } } \Tau=\{\{O^s\}_{s=1}^S,\{r,s\} \}T={ { Os}s=1S,{ r,s }} , where r and s are expressed as inO s O^sOImage features at position r in s .

Among them, the template and the image region II centered at position cThe similarity of I is calculated as follows:

Please add a picture description

The calculation method of f refers to other papers cited in this article.

To handle object scale variation, we consider different scales for each template. The template matching step generates response score maps for each template at each scale.

This paper uses max pooling on a two-level spatial pyramid to convert this response map into a feature vector.Please add a picture description

The final image representation is formed by concatenating all template pooling results at all image scales.

Bagging Based Classification Algorithm

Motivation

The algorithm uses a large number of image templates, resulting in a very high-dimensional feature vector for each image. Using more templates is advantageous because a richer image representation can be provided.

However, this feature vector is too complete and contains some elements that are not discriminative (templates are from regions rich in information). Such problems are called redundancy and non-discrimination. Its display is shown in the figure.Please add a picture description

Formulation of the Algorithm

This paper proposes to train a set of SVM classifiers for this feature representation and aggregate the results of all classifiers.
Please add a picture description

Discovering Localized Attributes for Fine-grained Recognition(没读懂)

Abstract

Correlation properties are local, but the question of how to choose these local properties is largely unexplored. This paper proposes an interactive method to discover discriminative and semantic local properties from image data labeled with fine-grained category labels and object boxes.

Our method uses a latent conditional random field model to discover detectable and discriminative candidate attributes. A recommender system is then used to select attributes that are likely to have semantic meaning.

Human interaction is used to provide semantic names for those discovered attributes.

Introduction

At present, most image classification and object recognition methods learn statistical models of low-level visual features, such as SIFT and HOG. While these methods provided decent results, they were not meaningful and limited the ability of humans to understand the models. Recent work introduces visual attributes as features.

Most of the attribute features are generated by manual labeling, which is very time-consuming and labor-intensive. Of course, some work has proposed automatic labeling technology, but it is limited to the grasp of global attributes ("city" and "wild"). This is not suitable for fine-grained tasks.

Finding local regions is a difficult task, either to find the parts that are different but not meaningful, or to find the parts that are meaningful but not recognizable.

This paper presents an interactive system to discover machine-detectable and human-understandable, discriminative local attributes from an image dataset annotated with fine-grained class labels and object bounding boxes.

At each iteration of the process, we identify the two most confusing categories based on the properties discovered so far. In this paper, a latent conditional random field model is used to automatically discover candidate attributes that can distinguish classes.

For candidate attributes, we use a recommender system to identify those that are semantically likely to be meaningful to humans. They are then presented to the user to collect property names.

For these candidates, the user can add a name to the attribute pool, and those without a name will be ignored.

In either case, the semantic model of the recommender system is updated based on user responses. Once the local attribute table is built, these attributes are detected in new images and used for classification.

Guess you like

Origin blog.csdn.net/weixin_46365033/article/details/127614964