The latest high-quality review of cross-modal retrieval "Image-text Retrieval: A Survey on Recent Research and Development"

Image-text Retrieval: A Survey on Recent Research and Development
Image-text Retrieval Research Progress Review
2022.03
This article has replaced the citations of the literature with the corresponding paper titles one by one to facilitate search and reading.

Summary

This paper provides a comprehensive and up-to-date survey of ITR methods from four aspects. By dissecting the ITR system into two processes: feature extraction and feature alignment , we summarize the recent progress of ITR methods from these two perspectives. On this basis, the efficiency study of ITR systems is introduced as the third perspective. In order to keep pace with the times, we also provide a groundbreaking overview of the cross-modal pre-trained ITR method from the fourth perspective . Finally, we provide an overview of common benchmark datasets and evaluation metrics for ITR and conduct an accuracy comparison of representative ITR methods. The article concludes with a discussion of some critical but understudied issues.

1 Introduction

Cross-modal image-to-text retrieval (ITR) is to retrieve relevant samples from another modality based on a given user's expression in one modality, and usually includes two sub-tasks: image-to-text (i2t) and text -Image (t2i) retrieval. ITR has broad application prospects in the search field and is a valuable research topic. In the past few years, we have witnessed the huge success of ITR due to the boom in deep models for language and vision. For example, with the rise of BERT, the cross-modal pre-training paradigm based on transformer has been developed, and its pre-training-fine-tuning form has been extended to downstream ITR tasks, accelerating its development.

Limitations of previous reviews: 1) In addition to ITR tasks, other multi-modal tasks such as video text retrieval and visual question answering have also been explored, resulting in less in-depth investigations of ITR; 2) Pre-training paradigms are basically absent in existing reviews was developed and it is indeed mainstream now. In view of this, in this paper we present a comprehensive and state-of-the-art survey on ITR tasks, especially an overview of pre-training paradigms.

An ITR system usually consists of the feature extraction process of the image/text processing branch and the feature alignment process of the integration module. In the context of such an ITR system, we construct a taxonomy from four perspectives to outline ITR methods. Figure 1 shows the classification skeleton of the ITR method.
Insert image description here
Figure 1: Classification skeleton illustrating the ITR method from four perspectives

(1) Feature extraction . Existing methods for extracting robust and discriminative image and text features fall into three categories. 1) Methods based on visual semantic embedding are committed to learning features independently. 2) In contrast, the cross-attention method learns features interactively. 3) Adaptive methods aim to learn features in an adaptive modality.

(2) Feature alignment . The heterogeneity of multimodal data makes the integration module very important for the alignment of image and text features. There are two variations of the existing method. 1) The global alignment-driven method aligns global features among each modality. 2) In addition to this, some methods try to find local alignment explicitly at a fine-grained level, the so-called methods involving local alignment.

(3) System efficiency . Efficiency plays a vital role in a good ITR system. In addition to research on improving ITR accuracy, a series of works pursue efficient retrieval systems in three different ways. 1) The hash encoding method reduces computational costs by binarizing features in floating point format. 2) The model compression method emphasizes low energy consumption and lightweight operation. 3) The fast-first-then-slow method performs retrieval by first coarse-grained fast retrieval, and then fine-grained slow retrieval.

(4) Pre-training paradigm . In order to stay at the forefront of research development, we also conducted in-depth research on cross-modal pre-training methods for ITR tasks that have attracted much attention recently. Compared with traditional ITR, pre-trained ITR methods can benefit from the rich knowledge implicit in large-scale cross-modal pre-trained models and can produce encouraging performance even without complex retrieval engineering. In the context of ITR tasks, cross-modal pre-training methods are still applied to the above three perspective taxonomies. However, in order to describe the characteristics of pre-training ITR methods more clearly, we reclassify them from three dimensions: model structure, pre-training tasks and pre-training data .

Next, we will summarize the ITR methods based on the above-mentioned taxonomy of the first three perspectives in the second section, and specifically mention the pre-training ITR method, the fourth perspective, in the third section. Section 4 details the accuracy comparisons between common datasets, evaluation metrics, and representative methods, and then gives conclusions and future work in Section 5.

2. Image-Text Retrieval

2.1 Feature extraction

Extracting image and text features is the first and most critical process in an ITR system. As shown in Figure 2, under the three different development trends of visual semantic embedding, cross-attention and adaptation , ITR's feature extraction technology is booming.
Insert image description here
Figure 2: Illustration of different feature extraction architectures

Visual Semantic Embedding (VSE) . Encoding image and text features independently is an intuitive and straightforward way of ITR. This VSE-based method has been widely developed and has roughly two aspects.

1) In terms of data, a series of works [Learning fragment self-attention embeddings for image-text matching. In ACM MM, 2019; Probabilistic embeddings for cross-modal retrieval. In CVPR, 2021] try to mine high-order data information for learning Powerful features. They treat all data pairs equally when learning features. In contrast, some researchers [Vse++: Improving visual-semantic embeddings with hard negatives. In BMVC, 2017] proposed weighting data pairs with large amount of information to improve feature recognition, and some researchers [Learning cross-modal retrieval with noisy labels. In CVPR, 2021] pays more attention to the mismatched noise correspondence in data pairs for feature extraction. Recently, riding on the trend of large-scale cross-modal pre-training technology, some works [Scaling up visual and vision-language representation learning with noisy text su-pervision. InICML, 2021; Wenlan: Bridging vision and language by large-scale multi-modal pre-training, 2021] directly utilize large-scale network data to pre-train image and text feature extractors, showing impressive performance in downstream ITR tasks.
2) Regarding the loss function, ranking loss is often used in VSE-based methods [deep visual-semantic embedding model. InNeurIPS, 2013; Vse++: Improving visual-semantic embeddings with hard negatives. In BMVC, 2017], and constrains the model of learning features. Data relationships between states. In addition to this, [Learning two-branch neural networks for image-text matching tasks. In TPAMI, 2018.] proposes a maximum marginal ranking loss with neighborhood constraints for better feature extraction. [Dualpath convolutional image-text embeddings with instance loss. In TOMM, 2020.] propose an instance loss that explicitly considers intra-modal data distribution.

Due to independent feature encoding, the VSE-based method implements an efficient ITR system in which features of a large number of gallery samples can be precomputed offline. However, it may bring sub-optimal features and limited ITR performance due to less exploration of the interaction between image and text data.

Cross Attention (CA) . [Stacked cross attention for image-text matching. In ECCV, 2018] was the first attempt to consider dense pairwise cross-modal interactions and produced huge accuracy improvements at the time. Since then, various CA methods have been proposed to extract features. Using the transformer architecture, researchers can simply operate the transformer architecture on the concatenation of images and text to learn cross-modal contextual features. It opens up a rich research line on transformer-like CA methods [Vilbert: Pretraining task-agnostic visiolin-guistic representations for vision-and-language tasks. In NeurIPS, 2019; Uniter: Universal image-text representation learning. In ECCV , 2020]. In addition, injecting some additional content or operations into cross-attention to assist feature extraction is also a new research direction. [Saliency-guided attention network for image-sentence matching. In ICCV, 2019] adopts a visual saliency detection module to guide cross-modal association. [Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In ACM MM, 2021] integrates intra-modal and cross-modal knowledge to jointly learn image and text features.

CA methods narrow the gap of data heterogeneity and tend to obtain high-accuracy retrieval results, but are cost-prohibitive since each image-text pair must be input to the cross-attention module online.

Adaptive (SA) . In methods based on VSE and CA, there is no fixed calculation process to extract features. [Dynamic modality interaction modeling for image-text retrieval. In SIGIR, 2021] builds an adaptive modality interaction network from scratch, in which different The pairs of can be adaptively input into different feature extraction mechanisms. It effectively inherits the respective advantages of the above two groups of methods and is classified as the SA method.

2.2 Feature alignment

After feature extraction, cross-state features need to be aligned to calculate pairwise similarity and achieve retrieval. Global alignment and local alignment are two directions.
Insert image description here
Figure 3: Illustration of different feature alignment architectures

Global alignment . In the global alignment-driven approach, images and text are matched from a global perspective, as shown in Figure 3(a). Early works [Vse++: Improving visual-semantic embeddings with hard negatives.In BMVC, 2017; Learning two-branch neural networks for image-text matching tasks. In TPAMI, 2018] are usually equipped with a clear and simple two-stream global feature learning network , pairwise similarity is calculated through comparison between global features. Later research [Adversarial representation learning for text-to-image matching. In ICCV, 2019; Dual-path convolutional image-text embeddings with instance loss. In TOMM, 2020] focused on improving this dual-stream network structure to better Align global features. However, the above-mentioned global alignment-only methods always exhibit limited performance because text descriptions usually contain finer-grained details of the image, which are easily smoothed by global alignment . However, there is an exception. Recent global alignment methods in the pretraining-finetuning paradigm [Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021] tend to produce satisfactory results, which is attributed to the small size of the pretraining data. expand.

In summary, applying global alignment only to ITRs may lead to inadequacies in fine-grained correspondence models and is relatively weak for computing reliable pairwise similarities. Complementing the global alignment with alignment in other dimensions is one solution.

Local alignment . As shown in Figure 3(b), the regions or patches in the image correspond to the words in the sentence, which is the so-called local alignment. Global alignment and local alignment constitute complementary solutions to ITR, a popular choice classified as methods involving local alignment.

Using vanilla attention mechanism [Stacked cross attention for image-text matching. In ECCV, 2018;Camp: Cross-modal adaptive message passing for text-image retrieval. In ICCV, 2019;Uniter: Universal image-text representation learning. In ECCV , 2020;Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021] is a trivial method to explore semantic region/patch word correspondence. However, due to the semantic complexity, these methods may not capture optimal fine-grained correspondences well. First, selectively focusing on local components is a solution to find optimal local alignment. [Focus your attention: A bidirectional focal attention network for image-text matching. In ACM MM, 2019] is the first attempt to selectively align the local semantics of different modalities. [Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR, 2020] and [context-aware attention network for image-text retrieval. In CVPR, 2020] is not far behind. The former learns to associate local components using an iterative local alignment scheme. The latter notices that an object or a word may have different semantics in different global contexts, and proposes to adaptively select information-rich local components for local alignment according to the global context. Since then, some methods with the same goals as above have been proposed, such as designing an alignment-guided masking strategy [Kaleido-bert: Vision-language pre-training on fashion domain. In CVPR, 2021] or developing an attention filtering technology [Similarity reasoning and filtration for image-text matching. In AAAI, 2021]. In addition, fully realizing local correspondence is also a way to approximate optimal local alignment. [Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In CVPR, 2019] enables different levels of text components to be aligned with regions of the image. [Step-wise hierarchical alignment network for image-text matching. In IJCAI, 2021] proposes a step-wise hierarchical alignment network that achieves local-to-local, global-to-local, and global-to-global alignment. Bridging vision and language with structured meaning representations. In CVPR, 2019] enables different levels of text components to be aligned with regions of the image. [Step-wise hierarchical alignment network for image-text matching. In IJCAI, 2021] proposes a step-wise hierarchical alignment network that achieves local-to-local, global-to-local, and global-to-global alignment. Bridging vision and language with structured meaning representations. In CVPR, 2019] enables different levels of text components to be aligned with regions of the image. [Step-wise hierarchical alignment network for image-text matching. In IJCAI, 2021] proposes a step-wise hierarchical alignment network that achieves local-to-local, global-to-local, and global-to-global alignment.

In addition to this, as shown in Figure 3(c), there is another type of local alignment, namely relationship-aware local alignment, which can promote fine-grained alignment. These methods [Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training. In NeurIPS, 2021; Multi-modality cross attention network for image and sentence matching. In CVPR, 2020] explore intra-modality relationships to facilitate alignment between modalities. In addition, some methods [Visual semantic reasoning for image-text matching. In ICCV, 2019;Ernie-vil: Knowledge enhanced vision-language representations through scene graph. In AAAI, 2021;Similarity reasoning and filtration for image-text matching. In AAAI , 2021] model image/text data as a graph structure, whose edges convey relationship information, and infer relationship-aware similarities with local and global alignment through graph convolutional networks. In addition, [Learning relation alignment for calibrated cross-modal retrieval. In ACL-IJCNLP, 2021] considers the consistency of relationships, that is, the consistency of visual relationships between objects and relationships between text.

2.3 Search efficiency

The feature extraction in Section 2.1 is combined with the feature alignment in Section 2.2 to form a complete ITR system and focus on the accuracy of retrieval. In addition, retrieval efficiency is the key to obtaining an excellent ITR system, which has led to a series of efficiency-focused ITR methods.

Hash encoding . Hash binary encoding has advantages in model computation and storage, alleviating growing concerns about hash encoding methods for ITR. These studies learn to map the features of samples into a compact hash encoding space to achieve efficient ITR. [Pairwise relationship guided deep hashing for cross-modal retrieval. In AAAI, 2017] simultaneously learned real-valued features and binary hash features of images and texts to benefit from each other. [Attention-aware deep adversarial hashing for cross-modal retrieval. In ECCV, 2018] introduces an attention module to find focused regions and words to promote binary feature learning. In addition to these methods in supervised settings, unsupervised cross-modal hashing is also a focus. [Self-supervised adversarial hashing networks for cross-modal retrieval. In CVPR, 2018] combines adversarial networks with unsupervised cross-modal hashing to maximize semantic correlation and consistency between the two modalities. [Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. In AAAI, 2021] designed a graph-neighbor coherence network to explore the neighbor information of samples for unsupervised hashing learning. The hash coding method is beneficial to improve efficiency, but it will also lead to a decrease in accuracy due to the simplified feature representation of binary codes.

Model compression . With the advent of the cross-modal pre-training era, ITR has made a huge leap in accuracy, but at the expense of efficiency. Pretrained ITR methods are often characterized by bulky network structures, which gives rise to model compression methods. Some researchers [Playing lottery tickets with vision and language. In AAAI, 2022] introduced the lottery ticket hypothesis to strive for a smaller and lighter network architecture. In addition, based on the consensus that the image preprocessing process occupies the main computing resource consumption in the pre-training architecture, some researchers [Pixel-bert: Aligningimage pixels with text by deep multi-modal transformers, 2020; Seeing out of the box : End-to-end pre-training for vision-language representation learning. InCVPR, 2021] specifically optimizes the image pre-processing process to improve retrieval efficiency. However, even with lightweight architectures, most of these methods, which usually use cross-attention for better feature learning, still take a long reference time due to the secondary execution of feature extraction.

First fast and then slow . The above two groups of methods cannot achieve the best compromise between efficiency and accuracy, so a third group of methods is produced: a combination of fast and slow methods. In view of the advantages of efficiency and accuracy of the VSE and CA methods in Section 2.1, some researchers [Thinking fast and slow: Efficient text-to-visual retrieval with transformers. In CVPR, 2021;Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021] proposed to first use fast VSE technology to screen out a large number of simple negative libraries, and then use slow CA technology to retrieve positive libraries, so as to achieve a good balance between efficiency and accuracy.

3. Pre-trained image-text retrieval

For ITR tasks, the early paradigm was to fine-tune networks that had been pre-trained in the fields of computer vision and natural language processing respectively. The turning point came in 2019 with increased interest in developing a general cross-modal pre-training modality and extending it to downstream ITR tasks [Vilbert: Pretraining task-agnostic visiolin-guistic representations for vision-and-language tasks . InNeurIPS, 2019;Visualbert: A simple and performant baseline for vision and language, 2019]. Under the powerful cross-modal pre-training technology, the performance of ITR tasks has experienced explosive growth without any fancy features. Currently, most pre-trained ITR methods adopt the transformer architecture as the building block. On this basis, research mainly focuses on model architecture, pre-training tasks and pre-training data.

Model architecture . A group of works [Vilbert: Pretraining task-agnostic visiolin-guistic representations for vision-and-language tasks. In NeurIPS, 2019;Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021] are interested in dual-stream model architecture , i.e. two independent encodings, followed by optional post-interaction on the image and text processing branches. At the same time, single-stream architectures that encapsulate image and text processing branches into one are becoming increasingly popular [Visualbert: A simple and performant baseline for vision and language, 2019; Unicoder-vl: A universal encoder for vision and language by cross-modal pre -training. InAAAI, 2020;Vilt: Vision-and-language transformer without con-volution or region supervision. InICML, 2021]. Most methods rely heavily on image pre-processing processes, usually involving object detection modules or convolutional architectures, to extract preliminary visual features and serve as input to subsequent transformers. The resulting problem is twofold. First, this process consumes more computing resources than subsequent processes, resulting in inefficiency of the model. Then, the predefined visual vocabulary from object detection limits the expressive ability of the model, resulting in low accuracy.

Encouragingly, research on improving image preprocessing procedures has recently become popular. In terms of improving efficiency, [Seeing out of the box: End-to-end pre-training for vision-language representation learning. InCVPR, 2021] adopts a fast visual dictionary to learn features of the entire image. [Pixel-bert: Aligning image pixels with text by deep multi-modal transformers.2020] Directly align image pixels with text in transformer. In addition, [Vilt: Vision-and-language transformer without convolution or region supervision. InICML, 2021; Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval. In SIGIR, 2020] feeds the patch-level features of the image into the transformer , [Kd-vlp: Improving end-to-end vision-and-language pretraining with object knowledge distillation. In EMNLP, 2021] segment the image into grids, aligned with text. In terms of improving accuracy, [Vinvl: Revisiting visual representations in vision-language models. In CVPR, 2021] developed an improved object detection model to enhance visual features. [E2e-vlp: End-to-end vision-language pre-training enhanced by visual learning. In ACL-IJCNLP, 2021] bring together the tasks of object detection and image captioning to enhance visual learning. [Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training. In NeurIPS, 2021] When learning image features, explore visual relationships by adopting a self-attention mechanism. Taking all this into account, [An empirical study of training end-to-end vision-and-language transformers.2021] thoroughly investigates these model designs and proposes a new end-to-end transformer framework that achieves efficiency and accuracy. Win-win. The advancement of cross-modal pre-training model architecture has promoted the performance improvement of ITR.

Pre-training tasks . The pre-training task guides the model to learn effective multi-modal features in an end-to-end manner. Pretrained models are designed for multiple cross-modal downstream tasks, so various pretext tasks are often invoked. These pretext tasks are mainly divided into two categories: image-text matching and occlusion modeling.

ITR is an important downstream task in the field of cross-modal pre-training, and its related pretext task, image-text matching, is well received in pre-training models. Generally, an ITR task-specific header is attached to a transformer-like structure to distinguish whether the input image-text pairs are semantically matched by comparing the global features of different modalities. It can be regarded as a pretext task for image-text coarse-grained matching [Vilbert: Pretraining task-agnostic visiolin-guistic representations for vision-and-language tasks.InNeurIPS, 2019;Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020;Uniter: Universal image-text representation learning. In ECCV, 2020;Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021;Vilt: Vision-and -language transformer without convolution or region supervision. In ICML, 2021]. In addition, it has been extended to the pretext task of image-text fine-grained matching: ptach-word alignment [Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021], region-word alignment [Uniter: Universal image-text representation learning. In ECCV, 2020] and region-phrase alignment [Kd-vlp: Improving end-to-end vision-and-language pretraining with object knowledge distillation. In EMNLP, 2021]. There is no doubt that the pre-trained image-text matching pretext task establishes a direct connection with the downstream ITR task, which narrows the gap between the task-independent pre-trained model and ITR.

Inspired by natural language processing pre-training, the masked language modeling pretext task is often used in cross-modal pre-training models. Correspondingly, the pretext task of occlusion visual modeling also appears in this case. Both are collectively referred to as occlusion modeling tasks. In NeurIPS, 2019; Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI , 2020;Vinvl: Revisiting visual representations in vision-language models. In CVPR, 2021], the input text follows a specific masking rule, randomly masking out several words in a sentence, and then this pretext task drives the network according to the unmasked words and input images to predict masked words. In occlusion visual modeling tasks, the network regresses the embedded features of the occluded region [Uniter: Universal image-text representation learning. In ECCV, 2020] or predicts its semantic label [Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020] or simultaneously [Kd-vlp: Improving end-to-end vision-and-language pretraining with object knowledge distillation. In EMNLP, 2021]. The occlusion modeling task implicitly captures the dependencies between images and text, providing strong support for downstream ITR tasks.

Pre-training data. Data-level research is a positive trend in the field of cross-modal pre-training. On the one hand, intra-modal and cross-modal knowledge in image and text data are fully exploited in pre-trained ITR methods [Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020; Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In ACM MM, 2021]. On the other hand, many studies focus on increasing the size of pre-training data. In addition to the most widely used large-scale out-of-domain datasets, especially those used for pre-training models [Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020;Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021], in-domain datasets originally used for fine-tuning and evaluating downstream tasks are added to the pre-training data for better multi-modal feature learning [Oscar: Object- Semantics aligned pre-training for vision-language tasks. In ECCV, 2020;Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021]. In addition to this, rich unpaired single-modality data can also be added to the pre-training data to learn more general features [Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. 2020]. In addition, some researchers [Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. 2020; Scaling up visual and vision-language representation learning with noisy text su-pervision. InICML, 2021; Filip : Fine-grained inter-active language-image pre-training. InICLR, 2022] to collect new and larger-scale data for the pre-training model. Such simple and crude operations usually bring results in various downstream cross-modal tasks. Excellent performance, including ITR. Generally speaking, data-level attention has a positive impact on cross-modal pre-trained models, which will naturally promote downstream ITR tasks. 2022] to collect new and larger-scale data for pre-training models, such simple and crude operations often lead to excellent performance in various downstream cross-modal tasks, including ITR. Generally speaking, data-level attention has a positive impact on cross-modal pre-trained models, which will naturally promote downstream ITR tasks. 2022] to collect new and larger-scale data for pre-training models, such simple and crude operations often lead to excellent performance in various downstream cross-modal tasks, including ITR. Generally speaking, data-level attention has a positive impact on cross-modal pre-trained models, which will naturally promote downstream ITR tasks.

4. Datasets and Evaluation

4.1 Dataset

Researchers have proposed various datasets for ITR. We summarize the most frequently used datasets below. 1) COCO Captions contains 123,287 images from Microsoft's Common Objects in COntext (COCO) dataset, each image has five artificially generated captions. After removing rare words, the average title length is 8.7. The dataset is divided into 82,783 training images, 5000 validation images, and 5000 test images. The researchers evaluated their model on 5x 1K test images and the full 5K test images. 2) Flickr30K includes 31,000 images collected from the Flickr website. Each image contains five text descriptions. The dataset is divided into three parts, 1000 images for validation, 1000 images for testing, and the rest for training.

4.2 Evaluation indicators

R@K is the most commonly used evaluation index in ITR. It is the abbreviation of the recall rate of the Kth position in the ranking list and is defined as the proportion of correct matches in the top K search results.

4.3 Accuracy comparison

We compare the accuracy of representative and state-of-the-art ITR methods in terms of feature extraction and feature alignment.

Feature extraction . We list the comparison results in Table 1. Among VSE-based methods, ALIGN [Scaling up visual and vision-language representation learning with noisy text su-pervision. InICML, 2021] far exceeds that due to large-scale pre-training on more than 1 billion image-text pairs. The amount of data of other pre-training methods, so there is a big improvement in accuracy over other methods. For the comparison between CA methods, we can see that the accuracy of these methods gradually improves over time. For the comparison between VSE-based methods and CA methods, 1) SCAN [Stacked cross attention for image-text matching. In ECCV, 2018], as the first attempt of the CA method, compared with the then VSE-based method LTBN [Learning two-branch neural networks for image-text matching tasks. In TPAMI, 2018], there has been a breakthrough in accuracy; 2) Overall, except for ALIGN, the CA method is overwhelmingly better than the VSE-based method in R@1 The advantage is attributed to the in-depth exploration of cross-modal feature interactions in the CA method. However, as an exception, pretraining of VSE-based methods on very large-scale data may offset the performance degradation caused by less exploration of cross-modal interactions, which is strongly supported by the results of ALIGN . To compare the SA method with VSE and CA based methods, under the same setting, i.e. traditional ITR, the SA method DIME [Dynamic modality interaction mod-eling for image-text retrieval. InSIGIR, 2021] outperforms the VSE based method on Flickr30k and CA methods, but is inferior to SAN [Saliency-guided attention network for image-sentence matching. In ICCV, 2019] on COCO Captions. There is room for further development of SA technology.
Insert image description here
Table 1: Accuracy comparison between ITR methods at R@1 from a feature extraction perspective. Methods marked with "*" represent pre-training methods. We show the best results for each method reported in the original paper

feature alignment. The comparison results are shown in Table 2. In terms of comparison among global alignment-driven methods, even if global alignment adopts a basic two-stream structure, ALIGN's R@1 is still above other methods, including TIMAM[dversarial representation learning for text-to-image matching. In ICCV, 2019 ] and PCME [Probabilistic embeddings for cross-modal retrieval. In CVPR, 2021], with a globally aligned complex network structure. Comparing among methods involving local alignment, ALBEF [Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021] shows excellent performance. It is worth noting that Uniter[Uniter: Universal image-text representation learning. In ECCV, 2020] and ViLT[Vilt: Vision-and-language transformer without con-volution or region supervision. InICML, 2021] only use vanilla attention mechanism can produce good results. In contrast, SCAN [Stacked cross attention for image-text matching. In ECCV, 2018] and CAMP [Camp: Cross-modal adaptive message passing for text-image retrieval. InICCV, 2019] adopts a similar mechanism and performs poorly at R@1. Uniter and ViLT perform ITR tasks in a pre-training-fine-tuning form, and the rich knowledge of cross-modal data in pre-training is beneficial to downstream ITR tasks. From the comparison between global alignment-driven methods and local alignment methods, the latter shows better performance than the former overall, indicating the importance of local alignment in achieving high-accuracy ITR.
Insert image description here
Table 2: Accuracy comparison between ITR methods at R@1 from feature alignment perspective. Methods marked with "*" represent pre-training methods. Global. and Local. are the abbreviations of global alignment and local alignment respectively. We present the best results for each method reported in the original paper

In addition, we summarize the development trend of ITR from 2017 to 2021 in Figure 4. Over the years, we can see a clear trend of improved accuracy. Specifically, the big leap forward in 2020 is thanks to pre-training ITR technology. Since then, the accuracy of pre-trained ITR methods has continued to gain momentum. It can be seen that pre-training ITR technology plays a leading role in promoting the development of ITR. It is inseparable from the support of ever-expanding training data scale. We can see that with the emergence of pre-training ITR, the amount of training data increases dramatically.
Insert image description here
Figure 4: The development trend of ITR in recent years. The black number above the line graph is the R@1 value of the method, and the red number is the number of multi-modal training data.

5. Conclusion and future work

In this paper, we provide a comprehensive review of ITR methods from four aspects: feature extraction, feature alignment, system efficiency, and pre-training paradigm. We also summarize the widely used data sets and evaluation metrics in ITR, and based on this, we quantitatively analyze the performance of representative methods. The report concludes that ITR technology has made great progress in the past few years, especially with the advent of the cross-modal pre-training era. However, there are still some less explored issues in ITR. We made the following interesting observations about possible future developments.
Insert image description here
Figure 5: Illustration of noise data in COCO header. (a) Based on the paired text description marked in orange, it is difficult to capture the content in the image. (b) In addition to positive image-text pairs with solid arrows, there also seems to be correspondence for negative image-text pairs with dashed arrows.

data . Current ITR methods are essentially data-driven. In other words, researchers design and optimize networks to seek the best retrieval scheme based on existing benchmark data sets. First, the heterogeneity of cross-modal data and the ambiguity of semantics inevitably bring noise to the data set. For example, as shown in Figure 5, there are elusive text descriptions for images, and there are multiple correspondences between images and text in COCO Captions. Therefore, to some extent, the results of current ITR methods on such datasets are still controversial. There have been some explorations on data diversity [Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR, 2019; Probabilistic embeddings for cross-modal retrieval. In CVPR, 2021; Learning cross-modal retrieval with noisy labels. InCVPR, 2021 ], however only the training data is considered and the test data is ignored. In addition, in addition to ordinary data information, i.e., images and text, the scene text appearing in the image is a valuable clue for ITR, which is usually ignored in existing methods. [Stacmr: scene-text aware cross-modal retrieval. In WACV, 2021] is a pioneering work that explicitly incorporates scene text information into the ITR model. These studies leave room for further development of ITR at the data level.

knowledge . Humans have a powerful ability to make semantic connections between vision and language. This is due to their accumulated common sense knowledge and ability to reason causally. Naturally, incorporating this high-level knowledge into the ITR model is valuable to improve its performance. CVSE [Consensus-aware visual-semantic embedding for image-text matching. InECCV, 2020] is a pioneering work that computes statistical correlations in a corpus of image captions as common-sense knowledge for ITR. However, this common sense knowledge is limited by the corpus and is not fully suitable for ITR. In the future, tailoring common sense knowledge for ITR and building causal inference models may be promising.

New paradigm . Under the current trend, pre-trained ITR methods have an overwhelming advantage in accuracy compared with traditional ITR methods. Pre-training and then fine-tuning on a large-scale cross-modal model has become a basic paradigm to achieve state-of-the-art retrieval results. However, this paradigm requires a large amount of labeled data during the fine-tuning stage and is difficult to apply to real-world scenarios. It makes sense to seek and develop a new resource-friendly ITR paradigm. For example, the recently emerging prompt-based tuning technology, with its excellent few-shot capabilities, provides guidance for developing such a new paradigm, the so-called pre-training-prompt.

Guess you like

Origin blog.csdn.net/zag666/article/details/131192917