Cross-modal retrieval paper reading: Cross Modal Retrieval with Querybank Normalization

Cross Modal Retrieval with Querybank Normalization Cross-modal retrieval based on QueryBank normalization

overview

Leveraging large-scale training datasets, advances in neural architecture design, and efficient inference, joint embedding has become a mainstream approach to solving cross-modal retrieval. This paper shows that, despite their effectiveness, state-of-the-art joint embeddings suffer severely from the long-standing "hubness problem", where a small number of gallery embeddings form the nearest neighbors for many queries. Inspired from the NLP literature, this paper proposes a simple yet effective framework called Querybank Normalization (QB-NORM) to renormalize the similarity of queries to consider hubs in the embedding space. Unlike previous work, we show that QB-NORM works efficiently without concurrent access to any test set queries. Within the QB-NORM framework, we also propose a new similarity normalization method – Dynamic Inversion Softmax, which is significantly more powerful than existing methods. We demonstrate QB-NORM in a series of cross-modal retrieval models and benchmarks, where it consistently enhances strong baselines and surpasses the state of the art.

1 Introduction

insert image description here
Figure 1: Left: Center question. We consider the problem of cross-modal retrieval, where queries q1 and q2 are compared with sample repositories x1 and x2. The high-dimensional joint embeddings employed by modern methods for cross-modal retrieval suffer from the "hub problem". A center (e.g. x2) is the nearest neighbor to multiple queries (q1 and q2), yielding low-quality retrieval results (bottom left).
Right: Querybank Normalization Using Querybank to normalize the similarity reduces the similarity of the center x2 to the query q1 and improves the retrieval results (bottom right).

The dominant cross-modal embedding paradigm employs deep neural networks to project modality-specific samples into a high-dimensional, real-valued vector space where they can be directly compared with appropriate distance metrics. A key challenge of this approach is the inherent "hubs" of this high-dimensional space – embedding vectors that appear in the set of nearest neighbors of many other embedding vectors.

Hubness is prevalent in a range of leading retrieval methods. If left untreated, Hubs can cause a significant drop in search rankings generated by the retrieval system. A contribution of this work is to show how these methods can be interpreted in a unified conceptual framework, called Querybank Normalization (QB-NORM, Figure 1 right), which uses a sample question bank during inference to reduce the number of hubs in the bank. Influence. Existing approaches suffer from two challenges: (1) so far, these approaches have only been proven suitable for concurrent access to multiple test queries—an assumption that is impractical for real-world retrieval systems; (2) ) are sensitive to the choice of query library and do actively hurt the performance of some query libraries (Table 2). To address the first challenge, we demonstrate through careful experiments (Table 1) that QBNORM does not require concurrent access to test queries to be effective. To address the second challenge, we propose a new normalization method - Dynamic Inverse Softmax (DIS), which operates as a module in the QB-NORM framework . We demonstrate that DIS provides efficient normalization that is more robust than previous methods.

This article contributes:

1. Proposed Querybank Normalization (QB-NORM), which is a simple non-parametric framework that brings a significant improvement in retrieval performance without the need for model fine-tuning; 2. For the first time (as far as we know) proves that,
in In the case where test queries other than the current query cannot be obtained, the Querybank normalization method maintains its effectiveness for cross-modal retrieval;
3. We propose dynamic reverse Softmax, which is a method for Querybank normalization The new normalization method is more robust than the previous literature;
4. QB-NORM is very effective in a wide range of tasks, models and benchmarks.

2. Related work


The hubs problem in cross-modal retrieval
It means that in a cross-modal retrieval system, there are some data samples (called "hubs") that have high similarity in multiple modalities, thus affecting the retrieval accuracy. Specifically, if a hub appears in multiple modalities at the same time, it becomes a bridge between multiple modalities, causing other samples to be affected in the similarity calculation in cross-modal retrieval.

Solving the problem of hubs in cross-modal retrieval usually requires a variety of strategies, including:
Cluster-based methods: by clustering data samples, hubs are grouped into different clusters, thereby reducing their impact on the relationship between different modalities. The impact of the similarity calculation.
Method based on dimensionality reduction: By reducing the dimensionality of data samples, the data dimension is reduced, so that the cross-modal retrieval system is more robust and the influence of the hub is reduced.
Regularization-based methods: By regularizing the objective function of the cross-modal retrieval system, the weights of the hubs are constrained so as to reduce their influence on the similarity computation.
The method based on importance weight: assign an importance weight to each data sample, and reduce their impact on the cross-modal retrieval system by adjusting the weight of hubs.
In conclusion, solving the hubs problem in cross-modal retrieval is a complex problem, which requires comprehensive consideration of multiple factors and multiple approaches.


Hubness mitigation
One paradigm focuses on rescaling—using similarity spaces to account for asymmetries in nearest-neighbor relations—a process that can be achieved through local and global scaling schemes.

3. Method

QB-NORM is a cross-domain normalization method for cross-modal retrieval. In cross-modal retrieval, since the data of different modalities have different statistical characteristics, they need to be normalized so that they have the same statistical characteristics, so as to facilitate the similarity calculation of cross-modal retrieval. The QB-NORM approach achieves cross-domain normalization by learning a mapping function that maps data of different modalities to the same distribution.

The specific steps of the QB-NORM method are as follows:
Assume that there are m modal data, and standardize the data of each modal so that their mean value is 0 and their variance is 1.
Combine the normalized data column by column into a large matrix X.
Perform principal component analysis (PCA) on X to reduce it to k dimensions (k<<m).
The top k principal components obtained by PCA are used as a new feature representation for similarity calculation in cross-modal retrieval.
A linear mapping is performed on the data for each modality to obtain a feature representation corresponding to the top-k principal components of PCA.
The feature representation of the data of each modality is superimposed through the linear mapping and the principal component feature representation to obtain the final cross-domain normalized feature representation.

The QB-NORM method can effectively reduce the variance between different modalities and improve the accuracy of cross-modal retrieval. At the same time, the QB-NORM method has the advantages of simple calculation and easy implementation, and is widely used in practical applications.

in conclusion

This paper introduces the Querybank Normalization framework for pivot mitigation and proposes a dynamic inversion Softmax for robust similarity normalization. It demonstrates its broad applicability across a range of tasks, models, and benchmarks.

Guess you like

Origin blog.csdn.net/zag666/article/details/129811622