Image retrieval|Classical paper reading|Quick start|Review study

Hello everyone, this is [Come to a Scallion Cake], this time it brings a technical study of image retrieval overview, welcome to pay attention and share with you~

This article mainly refers to the total paper "Deep Learning for Instance Retrieval: A Survey" (updated in 2022, the main content comes from 2020). Interested students can also take a look at
my other notes first, and do some research on image retrieval. An evaluation of Image Retrieval | Classical Methods | Quick Start | Overview

Abstract

With the increase in the amount of image data, it also brings new challenges to data processing, especially searching for similar content in the database - Content-Based Image Retrieval (CBIR) - a long-established real-time content retrieval needs to be improved The field of study for efficiency and accuracy.

In this review, we review the recent instance retrieval work developed based on deep learning algorithms and techniques, as well as by deep network architecture types, deep features, feature embedding and aggregation methods, and network fine-tuning strategies. Finally, commonly used benchmarks, evaluation results, common challenges, and promising future directions are proposed.

1 INTRODUCTION

Content-based image retrieval (CBIR) is the problem of searching relevant images in a gallery by analyzing visual content (color, texture, shape, objects, etc.). CBIR has been a long-term research topic in fields such as computer vision and multimedia. With the exponential growth of image data, the development of appropriate information systems to efficiently manage such large image collections is crucial, and image search is one of the most indispensable techniques. Thus, there is an almost endless potential for CBIR applications, such as person/vehicle re-identification, landmark retrieval, remote sensing, medical image search, online product search, etc.

Generally, CBIR methods can be divided into two distinct tasks: category-level image retrieval (CIR) and instance-level image retrieval (IIR). The goal of CIR is to find an arbitrary image representing the same category as the query (e.g., dog, car). In contrast, in the IIR task, the query image of a specific instance (e.g., the Eiffel Tower, my neighbor's dog) is giv. Our goal is to find images containing the same instance, which can be captured under different conditions , such as different imaging distances, viewing angles, backgrounds, lighting and weather conditions (re-identifying paradigms of the same instance). This investigation focuses on the IIR mission.

In the past two decades, amazing progress has been made in image representation, which mainly includes two important periods of feature engineering and deep learning. In the functional engineering era, the field is largely dominated by various landmark handcrafted image representations, such as SIFT and Bag of Visual Words (BoW). The era of deep learning was rekindled in 2012. Known as "AlexNet" (DCNN), "AlexNet" won the first place in the ImageNet classification competition with a breakthrough reduction in classification error rate. Since then, the dominant role of SIFT as local descriptors has been replaced by data-driven deep neural networks (DNNs), which can directly learn powerful feature representations with multiple levels of abstraction from data.

In light of this period of rapid development, the goal of this paper is to provide a comprehensive survey of recent achievements at the IIR. Compared with existing excellent surveys on traditional image retrieval, as shown in Table 1, our focus in this paper is to review deep learning-based IIR methods, especially on the issues of retrieval accuracy and efficiency.

1.1 Summary of Progress since 2012

Four network-level and feature-level perspectives form the basis of this investigation:

Improvements in Network Architectures

Deep Feature Extraction

Feature Embedding and Aggregation

Network Fine-tuning for Learning Representations

insert image description here

1.2 Key Challenges

While significant progress has been made in achieving accuracy and efficiency, these two competing goals remain challenging.

2 GENERAL FRAMEWORK OF IIR

We summarize a general framework of deep learning-based IIR in Fig. 2, including four main stages.

insert image description here

Network feedforward scheme

Deep feature extraction

Feature embedding and aggregation

Feature matching

The first four stages of IIR (from input data to output ranking list) rely on using DCNN as the backbone architecture.

3 POPULAR BACKBONE DCNN ARCHITECTURES

For IIR, there are four network models as the basis for feature extraction: AlexNet, VGG, Google LeNet, and ResNet.

4 RETRIEVAL WITH OFF-THE-SHELF DCNN MODELS

(off-the-shelf deep learning models)

This approach has limitations, the most basic of which is that there is a model transfer or domain transfer challenge between tasks, which means that after a model trained for classification, we must extract features that are well suited for image retrieval. In particular, classification decisions can be successful as long as the features remain within the classification boundaries. These models may show sufficient retrieval power, since feature matching itself is more important than classification. This section surveys developed strategies to improve the quality of feature representations, especially based on feature extraction/fusion and feature embedding/aggregation.

insert image description here

4.1 Deep Feature Extraction

The main problem of feature extraction is the mechanism of extracting retrieval features from non-DCNNs, which mainly involves three aspects: network feed-forward scheme, feature selection and feature fusion.

4.1.1 Network Feedforward Scheme

insert image description here

The network feed-forward scheme focuses on how to feed images into a DCNN, which includes single-pass and multi-pass.

Single Feedforward Pass Methods. Take the entire image and feed it into an off-the-shelf model to extract features. This method is relatively efficient because the input image is fed only once. In these methods, both the fully connected layer and the last convolutional layer can be used as feature extractors.

Multiple Feedforward Pass Methods. Compared to the single-pass scheme, the multi-pass approach is more time-consuming because several patches are generated and then aggregated into the network. These representations are typically produced in two stages: patch detection and patch description. A multi-scale image patch intermediate model (SPM) is obtained using a sliding window or a spatial pyramid, as shown in Figure 4. For example, Zheng et al. used SPM to segment images and extract features to achieve global integration of l, regional, and local contextual information

4.1.2 Deep Feature Selection

Feature selection determines the receptive fields of the extracted features, i.e. global layers from fully connected layers and regional layers from convolutional layers.

Extracted from fully connected layers. It is trivial to choose a fully connected layer as the global feature extractor. Through PCA dimensionality reduction and normalization, image similarity can be measured. This layer is fully connected, and each neuron produces an image-level descriptor, but this leads to two obvious limitations of IIR: including irrelevant information, and lacking local geometric invariance.

Extract from convolved layers. Features from convolutional layers (usually the last) preserve more structural detail, which is particularly beneficial for e.g. retrieval. Neurons in convolutional layers only detect a local region, and this smaller receptive field ensures that the resulting features retain more local structural information and are more robust to image transformations. Many image retrieval methods use convolutional layers as feature extractors

insert image description here

4.1.3 Feature Fusion Strategies

Fusion studies the complementarity of different features, including hierarchical and model-level fusion exploration.

Layer-level Fusion. Considering the difference and complementarity of which layer combinations, which layer combinations are more conducive to fusion is worth considering.

Model-level Fusion. Features from different models can be combined; this fusion pays more attention to the complementarity of models, dividing the methods into intra-model and inter-model. Intra-model fusion suggests multiple deep models with similar or highly compatible structures, while inter-model fusion involves models with different structures.

4.2 Feature Embedding and Aggregation

The main purpose of feature embedding and aggregation is to further improve the discriminative features extracted from DCNNs to obtain final global and/or local features to retrieve specific instances.

4.2.1 Matching with Global Features

Convolutional features can be interpreted as descriptors of local regions, so many works utilize embedding methods, including BoW, VLAD, and FV, to encode regional feature vectors, and then encode agg to adjust them (e.g., simply by summing operations) into global descriptions symbol.

Traditionally, pooling-based aggregation methods (e.g., in Figure 5) are plugged directly into deep networks and then use the entire model end-to-end. Three embedding methods (BoW, VLAD, FV)a are initially retrained using a large pre-defined vocabulary

4.2.2 Matching with Local Features

An important aspect of local features is to detect keypoints of instances in an image, and then describe the detected keypoints as a set of local descriptors. The strategy of the whole process of IIR can be categorized as detect then describe and describe then detect.

There are two limitations to using local descriptors for instance retrieval tasks. First, local descriptors of images are stored separately and independently, which is memory-intensive and not very suitable for large-scale scenarios. Second, estimating the similarity between the query image and the database image relies on the cross-matching of all local descriptor pairs, which will incur an additional search cost. Therefore, most instance retrieval systems using local features follow a two-stage paradigm: initial Filter and reorder, as shown in Figure 2. The initial filter stage uses a global descriptor to select a set of candidate matching images, thereby reducing the solution space; the reranking stage uses a local descriptor to reorder the top ranked ones.

4.2.3 Attention Mechanism

The attention mechanism can be regarded as a kind of feature aggregation, and its core idea is to highlight the most relevant feature parts, which is realized by computing the attention map. Obtaining methods Attention maps can be divided into two groups of non-parametric maps and two types of non-parametric maps, as shown in Figure 6. The main difference is whether the importance weights in the attention maps can be learned.

4.2.4 Hashing Embedding

Due to its computational and storage efficiency, hashing algorithms have been widely used for global and local descriptors. During the hash function training process, the hash codes of the original similar images are embedded as closely as possible, and the hash codes of different images are separated as much as possible.

5 RETRIEVAL VIA LEARNING DCNN REPRESENTATIONS

Fine-tuning methods have been extensively studied to learn better retrieval features. DCNNs are pre-trained with source datasets for image classification that are reasonably robust to inter-class variability, and pairwise supervision information is subsequently incorporated into the ranking loss for regular fine-tuning of the network. After network fine-tuning, features can be organized as global or local to perform retrieval.

5.1 Supervised Fine-tuning

insert image description here

5.1.1 Fine-tuning via Classification Loss

Then, the DCNN can be fine-tuned by optimizing the parameters of the DCNN according to the cross-entropy loss, as shown in Fig. 7(a). The fine-tuned network yields superior features on landmark-related datasets. The cross-entropy loss can minimize the intra-class distance while maximizing the inter-class distance. The cross-entropy loss essentially maximizes the common mutual information between the retrieved features and the gt

5.1.2 Fine-tuning via Pairwise Ranking Loss

The pairwise ranking loss learns an optimal metric that minimizes or maximizes the distance of corresponding image pairs in order to preserve their similarity.

Fine-tuning with Siamese Networks. The simese loss has recently been reaffirmed as a very effective metric in category-level image retrieval and, if implemented carefully, outperforms many more complex losses

Fine-tuning using a triple network. Triplet networks optimize both similar and dissimilar pairs.

5.1.3 Discussion

In some cases, the pairwise ranking loss cannot effectively learn the variation between samples, and its generalization ability is still weak if the training set is not properly sorted. Thus, the pairwise ranking loss requires careful sample mining and weighting strategies to obtain the most informative training pairs, especially when considering small batches.

A hard negative mining strategy is commonly used, but more complex mining strategies have recently been developed. There is an article that calculates the pairwise distance matrix samples on all mini-batches and selects the two closest negative pairs and an anchor positive pair to form a triplet for fine-tuning. Instead of iterating over all possible combinations of doublets or triplets, it is also possible to merge all positive samples in one cluster and the other negative samples.

When carefully tuned on a fine-grained classification-level retrieval task, cross-entropy loss can match or even exceed pairwise ranking loss.

5.2 Unsupervised Fine-tuning

Unsupervised fine-tuning methods for image retrieval are highly necessary but less researched. For unsupervised fine-tuning, there are two directions: 1. Mining the correlation between features through manifold learning; 2. Through clustering techniques. will be discussed below.

5.2.1 Mining Samples with Manifold Learning

Manifold learning focuses on capturing intrinsic correlations on the manifold structure to mine or infer heuristics, as shown in Figure 9. The initial similarity between the extracted global features leverages global or local features to construct an affinity matrix, which is then re-evaluated and updated using manifold learning.

Capturing the geometry of the deep feature manifold is important and usually involves two steps, called diffusion. First, the affinity matrix (Fig. 9) is interpreted as a weighted k-neural network graph, where each vector is represented by a node and edges are defined by the pairwise affinities of two connected nodes. Then, re-evaluate the similarity values ​​of the pairwise affinities elements by diffusion in the case of all other factors through the graph

insert image description here

5.2.2 Mining Samples by Clustering

Clustering methods are used to explore proximity information that has been studied in instance-level image retrieval. The rationale behind these methods is that samples in a cluster are likely to satisfy a certain degree of similarity.

One of the methods for deep feature clustering is through kmeans.

knn method. Fine-tuning is performed by minimizing the squared distance between each query feature and the average of its k nearest features.

There are further techniques, such as retrieval, such as using autoencoders, generative adversarial networks (GANs), convolution kernel networks, and graph convolume networks. For these methods, they focus on designing new unsupervised frameworks to enable unsupervised learning, rather than iterative similarity diffusion or clustering on f to refine the feature space.

6 STATE OF THE ART PERFORMANCE

6.1 Datasets

UKBench (UKB) consists of 10,200 object images. The dataset has 2550 sets of images, and each set has 4 images of the same object from different viewpoints or lighting conditions. All images can be used as queries.

Holidays includes 1491 images collected from individual holiday albums. Most images are related to the scene. The dataset includes 500 groups of similar images, each group has a query image.

Oxford-5k consists of 5062 images of 11 Oxford buildings. Each building is associated with five hand-drawn bounding box queries. Another unconnected add 100,000 distractor images Oxford-105k.

Paris-6k includes 6412 images, which are divided into 12 groups by building structure. The images were annotated with the same type of l-labels used at Oxford - 5k. More recently, additional query and interference images were added to Oxford-5k and Paris-6k, resulting in revisited Oxford (Roxford) and revisited Paris (RParis) data ets. We also perform partial comparisons on these revisited datasets under the hard evaluation protocol. INSTRE [151] consists of 28,543 images from 250 different object classes, including three disjoint subsets 2: INSTRE-S1, INSTRE-S2, INSTRE-M.

Google Landmark Dataset (GLD), consisting of GLDv1 and GLD-v2. The number of images in GLD-v1 will shrink over time as these images may be removed. GLD-v2 has the advantage of stability, and all images are licensed. GLD-v2 consists of over 5 million images with over 200,000 k-day images interpolated with instance labels, making it the largest instance recognition dataset to date. It is divided into three subsets: (i) 118k query images with ground truth annotations, (ii) 4.1M training images with 203k landmarks labeled, (iii) 762k indexed images with 101k landmarks. The training set was further cleaned by removing clutter, including a subset of 1.6M 81k images “GLD-v2-clean” landmarks

6.2 Evaluation Metrics

Average precision (AP) is the coverage area under the precision-recall (PR) curve

All query images are evaluated using mean average precision (mAP)

The NS score is a measure of UKBench; the NS score is the average of the top 4 accuracies on the dataset.

6.3 Performance Comparison and Analysis

insert image description here

Figure 10 summarizes the performance of the 6 datasets from 2014 to 2020. Early on, the powerful feature extraction of DCNNs led to rapid improvements. The key ideas that follow are to extract instance-e region-level features, reduce image clutter, and improve feature-discriminative feature embedding by feature fusion, feature aggregation, etc. Fine-tuning is an important strategy for improving performance by tuning deep networks specific to learning instance properties

Evaluation of a single feed-forward pass. It can enhance the feature discrimination ability and improve the effect.

Evaluation of multiple feed-forward passes. This paper reports the results of the approach in Figure 4. Among them, using over-densely extracted image patches has the highest performance on 4 datasets, and the rigid grid is competitive (87.2% mAP for Paris-6k). These two methods consider more patches and even background information when used for feature extraction. Instead of generating patch density, region proposals and spatial pyramid modeling introduce a certain purpose and efficiency in image objects. Spatial information can be better maintained using multiplex chemistries than single use.

Evaluation of supervised fine-tuning. Fine-tuning a deep network generally improves accuracy compared to off-the-shelf models, see Table 3. For example, the results of Niu Ford-5k using pre-trade improve the expected VGG from 66.9% to 81.5% when using single-marginal Siamese loss. A similar trend can also be observed on the Paris-6k dataset. For classification-based fine-tuning, its performance can improve regularity by using feature augmentation methods such as powerful DCNNs and attention mechanisms.

Evaluation of Unsupervised Fine-tuning. Compared with supervised fine-tuning, unsupervised fine-tuning methods are relatively less explored. The difficulty with unsupervised fine-tuning is that mining samples do not have "true" label correlations. In general, the performance of unsupervised fine-tuning methods should be expected to be lower than supervised fine-tuning methods. Fine-tuning on different datasets may produce Different final retrieval performance.

Deeper networks consistently lead to better accuracy due to the extraction of more discriminative features.

Different methods of aggregating the same off-the-shelf DCNN lead to differences in retrieval performance.

High-dimensional features usually capture more semantics and facilitate retrieval.

With global feature dimension and region search, more background or irrelevant regions are also extracted and used for cross-matching (i.e. many-to-many matching) with negative impact on performance.

Using global features, after the initial filtering step, reranking further improves retrieval accuracy.

insert image description here

7 CONCLUSIONS AND FUTURE DIRECTIONS

In this review, we review the retrieval taxonomy of deep learning methods, identify milestone methods, reveal connections among various methods, enable performance comparisons, propose representative methods, and discuss their strengths and limitations. It is clear that deep learning has made remarkable progress in IIR, but there are still many unsolved problems. Some potential future research directions are listed below:

(1) Large-scale instance retrieval. While the IIR field is progressing rapidly, most SOTA methods are tested on very small datasets with limited instance classes.

(2) Specialized and general instance retrieval. Of course, there is growing interest in specialized instance retrieval such as landmark retrieval, pedestrian retrieval, vehicle retrieval

(3) Invariant feature representation. One of the main challenges of instance retrieval is large intraclass availability, including changes in viewpoint, scale, lighting, weather conditions, background clutter, etc.

(4) Incremental image retrieval. One direction is to build a state-of-the-art retrieval model to handle the continuous stream of new instances via incremental learning methods

(5) Adversarial robustness. In the field of image retrieval, the adversarial robustness of image retrieval has received very limited attention and deserves more attention in the future.

Also, there is a good review: SIFT Meets CNN: A Decade Survey of Instance Retrieval – published in 2017

Code words are not easy, if you have seen this, why don’t you give a thumbs up~
Here is [ Come on a Scallion Cake ], your likes + favorites + attention are the biggest motivation for me to persevere~

insert image description here

Guess you like

Origin blog.csdn.net/weixin_42784535/article/details/128459432