A Survey of Full-Cycle Cross-Modal Retrieval: From a Representation Learning Perspective

A Survey of Full-Cycle Cross-Modal Retrieval
: From a Representation Learning Perspective

Cross-modal retrieval overview

Insert image description here
Figure 1. Issues and challenges in cross-modal retrieval

Feature extraction

Feature extraction is the core module of cross-modal retrieval, which encodes original corpus into embeddings, such as visual embeddings and language embeddings. By applying deep learning models, a range of features can be extracted. Compared with the traditional CNN network [11] that focuses on pixel-level grid features, more methods for exploring regional features in images have recently emerged, such as the Faster-RCNN algorithm proposed by [12]. Transformer [13] and BERT [14] architectures are widespread examples of pre-training and fine-tuning. For example, ViT [15] can handle patch features directly, while BERT, UniLM [16], RoBERTa [17], T5 [18], BART [19], transformer and ViT support text encoders. There are many choices for image encoders, including Faster-RCNN, ResNet [20], Visual Dictionary [21], Swin transformer [22], EfficientNet [23] and Linear Projection.

Add pre-trained model

The researchers added pre-trained models to the cross-modal retrieval system to model the interactions between cross-modal representations. Research shows that unlike relationships between words, visual concepts in images are critical and complex for cross-modal representation. By extending the BERT model to images and text, ViLBERT [24] targets region-based object detection, using Faster-RCNN to encode independent sequences of regions. LXMERT [25], similar to ViLBERT, encodes regions into a series of region of interest (ROI) features. In addition to regional features, pixel-level grid features are also encoded, such as SOHO [26], CLIP-ViL [27] and pixel-BERT [28]. They gave up the time-consuming Faster-RCNN. On the contrary, there are studies supporting ResNet to extract grid features. In addition to area and grid features, patch projections are also used in many scenarios to present image features. ALBEF [29] directly uses the ViT encoder to process plaque features and generate multiple flattened two-dimensional plaques. OSCAR [30] and ERNIE-ViL [31] develop additional information to facilitate semantic alignment. OSCAR adds region labels from images as anchors and then implicitly aligns them to text words. In contrast, ERNIE-ViL models a scene graph and focuses on objects with attributes and relationships.

feature alignment

Extensive research has been conducted on both image and sentence retrieval to align images and texts with the same semantics [32, 33, 34, 35, 36, 37]. In the early days of cross-modal registration research, Reference [32] developed a model using CNN and Bi-RNN to construct image and region descriptions. The alignment model combines a CNN for image regions and a bidirectional RNN for sentences. Structured targets leverage multimodal embeddings to align two modalities. Carvalho et al. [33] exploited both retrieval and class-guided features, formulating a joint objective function and classification loss in a shared latent space. The double loss is exactly retrieval loss and class loss. The double-triple scheme brings new ideas of loss functions to cross-modal research. Some researchers proposed a dynamic router scheme for interaction between different modalities [34]. They designed a four-unit framework for dynamic alignment of fine-grained fragments. ViLT [35] exploits linear projection for matching and demonstrates improvements based on aligned pre-trained models, resulting in embedded images and captions. Inspired by the highlights of OSCAR and ERNIE-ViL, ROSITA [36] enhances the alignment effect by integrating cross-modal and intra-modal knowledge. Furthermore, another study [37] provides an instance-oriented architecture for visual language tasks, utilizing dot products to align text and images.

work process

The cross-modal retrieval framework mainly includes fine-grained components: representation, transformation, alignment, fusion and joint learning . This section will introduce the specific design, including various important stages. Figure 2 shows the comprehensive architecture of a typical system in this field. In a full-cycle workflow, these modules are transformed into the following methods, including preprocessing, encoder representation, cross-modal attention, and decoder mechanisms . These stages facilitate efficient extraction and retrieval of information from different modalities.
Insert image description here
Figure 2: Overview of the cross-modal retrieval process

preprocessing

Preprocess input data to reduce noise and prepare it for subsequent processing. This stage converts image/video and text phrase input into visual and textual markup. Furthermore, there are differences between the various modes, so preprocessing will differentiate them. In addition to standard tokenization, there are also the following modules.

encoder representation

The second stage involves representing each modality independently using feature extraction methods. The encoder stage collects input from visual and textual tokens and generates intermediate states to encode semantic content. After embedding, the most common way to build an encoder is to encode the sequence of tokens using LSTM, convolution, and other techniques. In terms of text representation, word embeddings, position embeddings, and segment embeddings are all fed into the BERT encoder. In addition, a series of features, such as image representation, are also aligned with text representation. In this case, patch, grid and region features are extracted from the visual domain.

Visual language pre-training models combine feature extraction and feature fusion with pre-training tasks. These parts address challenges such as quantifying text and images and feeding them to models for learning, handling the challenge of representing interactions, and building pre-training tasks to help models learn alignment information. Pre-training on large-scale data can learn semantic associations between different modalities, thereby solving the problem of difficult to obtain expensive human annotations. There are two core pre-training options in terms of fused encoders and dual encoders to aggregate information from paired data. The single encoder mainly improves BERT input, while the dual encoder mainly performs collaborative/cross-BERT. We studied many current publications from 2018 to 2022 and classified them into single-stream and dual-stream models based on how they treat pretrained models. Table 1 shows the roadmap for fused encoder and dual-encoder pre-trained models. Research shows that single-flow designs directly pay self-attention to both modalities, ignoring interactions within the modalities. Therefore, some researchers advocate adopting a two-stream architecture to describe cross-modal interactions.
Insert image description here
Table 1: Roadmap for pretrained models with fused encoders and dual encoders.

cross-modal attention

Much research has been devoted to solving the aforementioned representation problem through multimodal interaction modeling. Based on multimodal representations, correlation modeling is used to learn common representations. Cross-modal interactions facilitate other interactions between two different modalities, thereby improving visual language tasks. We classify attention into upper-lower attention, lower-upper attention, circular attention, cross attention, joint attention, distilled attention, reticular memory attention and X-linear attention. Different attention mechanisms have different degrees of cross-modal information fusion. Bottom-up attention methods [51] have been widely used to achieve reasoning through fine-grained analysis and even multi-level reasoning. The bottom-up process indicates image regions, each with its own feature vector, while the top-down mechanism sets feature weights. According to the research of [52], image text retrieval uses recurrent attention memory to perform iterative operations and correspondences between vision and text through repeated alignment stages. This study deepens the understanding of segment correspondence by exploring attention mechanisms. This understanding is compatible with intricate semantics and suggests progressively exploiting the complex relationships between images and text. Cross-attention conveys encoder and decoder information in [14]. Transformer tracking [53] (TransT) avoids falling into local optima of semantic information algorithms. To solve the problem of building a high-precision tracking system, TransT introduces a unique attention-based feature fusion network. The attention mechanism creates long-distance feature connections, allowing the tracker to focus on important information while extracting a large amount of semantic information. The combination of self-attention and guided attention is called synergistic attention. The distilled attention framework [54] is a dual-encoder model that enables faster inference than standard fused encoders due to its deep interactive modules. In this study, dual encoder training is guided by fused encoder mentor information in annotations, and the proposed knowledge distillation includes two stages of pre-training distillation and fine-tuning distillation, which ultimately outperforms other methods. The use of grid memory allows the encoder to operate at multiple levels, learning low-level and high-level relationships simultaneously. X-Linear Attention [55] developed by Pan et al. [55] enables high-order feature interactions, while the dual-line fusion technique improves cross-linear attention by using spatial and channel dual-line attention distributions to capture second-order interactions between input types. Content interpretation of modal information. Stacked cross-attention is widely used by many researchers to maximize the study of visual language features.

Fine-grained deep learning methods

Fine-grained deep learning methods focus on advanced feature extraction, learning feature representations, and establishing high-dimensional correlations between various modalities. In this section, we critically review and analyze full-cycle approaches used in cross-modal retrieval processes, highlighting their effectiveness and potential for further improvements.

feature engineering

We divide feature extraction into global features and local features according to granularity, as shown in Figure 3. Subsequent studies exploited global features, such as VSE++ [61], ACMR [62] and DSPE [63]. In contrast, local features are used in works such as DAN [64], SCAN [56], SCO [65] and PVSE [66].
Insert image description here
Figure 3. Classification diagram of VL feature extraction

We further classify feature extraction into two types, visual embedding and text embedding, which are key components of many cross-modal retrieval systems. Visual embedding affects retrieval efficiency to a great extent, and current research is extensive and in-depth. In text embedding methods, BERT-like structures are often used to extract features. Unlike text embedding, visual embedding employs different levels of extraction, including region, grid, and patch levels. Faster RCNN is a second-order object detector that is widely used for regional feature extraction based on object detection. For example, ViLBERT and LXMBERT use joint attention to combine multi-modal information. VisualBERT, VL-Bert and UNITER use merged attention for multi-modal information fusion, while OSCAR and VinVL require additional image labels. Still, this approach has significant drawbacks. Training may freeze object detection. It limits visual concept recognition and loses contextual information. Furthermore, it cannot describe connections between many objects. All the above restrictions are based on the region extraction characteristics. CNN-based technology is another popular method for extracting visual features. Typical CNN networks in Pixel Bert and CLIP-ViL are used to obtain grid features, while transformers are used to obtain text. SOHO utilizes a learnable visual vocabulary to discretize mesh features, which are then fed into a multi-modal module. Compared with inconsistent optimizers (i.e. CNN using SGD and transformer using AdamW), it performs worse than OD-based methods. Patch projection enables image slicing to extract features. A common approach, like ALBEF, uses ViT directly.

cross-modal interaction

Compared with feature representation, image-text matching strategies improve consistency by studying semantic relationships. Cross-modal interactions play a crucial role in establishing connections between different modal representations. This interaction involves matching each pixel, region, or patch with a specific label. There are three main methods for cross-modal interaction, namely visual language alignment, visual language reconstruction and visual language embedding based on semantic association.

visual language alignment. Visual language alignment aims to maximize the comparability of image-text pairs by leveraging large-scale contrastive learning in a dual-encoder model. It adopts a re-sharing strategy to solve the cross-modal heterogeneity problem between two network branches. Furthermore, intra-modal similarity is learned through two connected CNN models using samples from the exact modality. In traditional research, the participation mode of cross-modal retrieval mainly relies on artificial expert knowledge and experience input. However, research [67] proposed a dynamic interaction mechanism for cross-modal retrieval modeling, namely DIME. DIME uses different interaction methods depending on the complexity of the sample. The model includes a local modification unit, an intra-modal reasoning unit, a global local guidance unit and a modification unit. ViLT [35] is a novel method that incorporates visual embedding features through patch projection and patch-level matching of image and text information. It can effectively improve the performance of cross-modal retrieval by avoiding time-consuming object recognition and convolution techniques with limited expressive capabilities. Similarly, in the study of [36], ROSITA adopts a pre-training task to enhance fine-grained semantic registration by suppressing the interference of intra-modal context and eliminating potential noise interference. These advances demonstrate the effectiveness of these techniques in overcoming the limitations of traditional cross-modal retrieval methods. The ROSITA model draws inspiration from OSCAR and ERNIE-ViL. Furthermore, a recent study proposed a new alignment model [68], which embeds images and captions into the same subspace, enhancing image-caption retrieval. The ALBEF model [29] adopts a pre-fusion registration method and utilizes transformer-based ViT to collect image features without using CNN. The ViT model uses BERT to process text, using the first six layers to process single-modal text, and the last six layers to process multi-modal text. The model first performs self-attention on text, and then performs cross-attention and visual feature fusion. Furthermore, some studies have extensively explored instance alignment. For example, X-DETR [37] introduced a versatile architecture for instance-level alignment and found that for visual language tasks, expensive joint modal transformers may be redundant, while weakly annotated data may be beneficial. X-DETR uses dot products to align graphics and text. UVLP [69] It is demonstrated that a combination of image-text alignment and whole-image text alignment based on two key criteria can achieve excellent unsupervised visual language pre-training without parallel data. The authors propose the construction of a weakly supervised paired corpus and granular alignment pre-training tasks. Their unsupervised pre-training strategy aims to build robust joint representations for misaligned text and images, showing admirable results across a variety of tasks in an unsupervised setting. The above alignment methods have specific criteria in terms of dataset size, quality, and model granularity that are critical to achieving optimal results. These techniques emphasize the importance of fine-grained matching in cross-modal retrieval.

Visual language reconstruction . Unlike visual language alignment, reconstruction pays more attention to global information. DSPE [63] solves the matching problem by learning image text embeddings. The optimization of the loss function aims to improve the distribution of features in high-dimensional space, thereby producing more effective clustering effects. MASLN [70] proposes a solution to the problem that classes cannot traverse instances. The proposed solution consists of using a reconstruction subnetwork to reconstruct each modality dataset using a conditional autoencoder. Subnetworks exploit information from input to output while minimizing distribution differences. In addition, MASLN also introduces an adversarial sub-network to develop semantic representations. The reference study [71] investigated neural networks for embedding and similarity computation. The embedding network learns the latent embedding space using new neighborhood constraints and maximum marginal ranking error. Compared with ordinary triple sampling, the authors improved neighborhood sampling to produce extremely small batches. The similarity network uses element-wise products and is trained with a regression loss to directly predict the similarity score. Extensive experiments show that the network can accurately locate phrases. In recent research, visual and text retrieval problems are reformulated as text and visual transformation tasks [72]. To solve this task, the authors propose a cycle-consistent network. In another related study [73], the attention mechanism was enhanced by adding a scene graph structure. Specifically, the sentence reconstruction network creates a scene graph from the objects, attributes, and relationships extracted by the detection network. The graph convolutional network then processes the generated graph to generate a word vector, which is fed into a pre-trained dictionary shared by the encoder-decoder model. This approach makes the visual descriptions in the generated corpus more natural and human-like.
Reconstruction studies overcome the limitations of embedding space. The reconstruction method employs a deep autoencoder to minimize heterogeneity and improve semantic discrimination. In addition, compared with cross-modal registration, cross-modal reconstruction has lower requirements on data sets and lower annotation costs, making it suitable for small and medium-sized data sets.

Visual language embedding . Joint embedding integrates global and local information into semantic feature embeddings, thereby developing superior feature discrimination capabilities. Research on DSCMR [74] proposed a supervised learning structure to preserve semantic distinction and modality invariance. It creates two subnetworks with weight sharing restrictions. The authors reduce the discriminative loss in label and common representation spaces, increasing the importance of learned common representations. The learning strategy of DSCMR can completely unify paired label and classification information and successfully learn typical representations of heterogeneous data. PCME [75] matches a picture to numerous captions, or a caption to multiple pictures. The authors argue that the deterministic functions of most existing models are insufficient to capture one-to-many correspondences. The joint representation space PCME paradigm maps one-to-many relationships. It uses probabilistic mapping and does not require a precise formulation of many-to-many matching. Uncertainty estimation enables PCME to evaluate retrieval difficulty and failure probability, i.e., auxiliary interpretability aspects. Probabilistic models learn from a more mature embedding space where set relations are also beneficial, whereas in exact spaces only similarity relations are beneficial. Probabilistic mapping complements precision retrieval systems. ViSTA [76] proposes a transformer framework for learning aggregate visual representations by directly encoding patch and scene embeddings. It proposes a novel aggregation tag for embedding image pairs and combining them into a shared space. Bidirectional contrastive learning loss solves the modal loss problem of scene text.
This joint embedding strategy focuses on high-level semantics. Rich semantic association methods can successfully resolve polysemy instances. Furthermore, visual language embedding can improve the accuracy and scalability of image-text matching. In addition, embeddings also have strong retrieval performance.

Pre-training tasks

In cross-modal retrieval, the input is unstructured and converted into a vector format. Judging from previous research, data-driven pre-trained models can learn from them and are greatly affected by the results of pre-training tasks. We classify and summarize pre-training tasks in cross-modal retrieval and divide them into text-based tasks, vision-based tasks and cross-modal tasks. Table 2 lists the glossary of pre-training tasks. We show how to leverage pre-training tasks to train models, which are crucial for universal representation. The main goals of the pre-training tasks include sequence completion, pattern matching, and providing temporal/contextual features.
Insert image description here
Table 2. Pre-training task vocabulary list

Unified visual language architecture

This section describes how we study the unified architecture critical for learning visual and linguistic information. We summarize visual language (VL) architectures into two categories: general representations and unified generative models in recent references. Universal representations aim to learn a single embedding space that can represent multiple patterns. Unified generative models are a form of cross-modal retrieval that use a single model to build content representations across multiple modalities. Both methods have pros and cons, and which method to choose depends on your specific requirements. First, we will provide an overview of both architectures in this section. We will then provide a comprehensive assessment of the pros and cons of both, highlighting their strengths and weaknesses.
Universal notation . A universal representation is crucial for efficiently comparing similarities across modalities in cross-modal retrieval. To achieve this goal, the DSCMR model proposed by [74] provides a common representation space that allows direct comparison of samples from multiple modalities. The framework adopts a supervised cross-modal learning method to establish connections between different modalities, successfully learning common sentences while preserving semantic distinctions and modality invariance. To discover cross-modal correlations, the last layer of the model contains two subnetworks with weight sharing constraints. A modality invariance loss is incorporated into the objective function to eliminate differences, while a linear classifier classifies the data in a common representation space. These characteristics together make the DSCMR model a promising cross-modal retrieval method. The method proposed in SDML [77] defines the common space in advance while minimizing the gap between groups. SDML is the first model to support unlimited multi-modal input. To train specific networks for different modalities, the input is projected into a predefined subspace. This approach can train more modalities without learning all modalities simultaneously. UNITER aims to address the problem of determining whether to learn a universal visual language representation for all VL tasks. Its large-scale pre-training process enables it to handle various downstream VL tasks and multi-modal joint embedding.
In addition to joint representations, universal encoders have also been extensively studied. For example, Unicoder-VL develops a universal visual and language encoder. Unicoder-VL adopts three pre-training tasks, including MLM, MOC and VLM. These tasks collaborate to create context-aware representations of input tokens. It also attempts to predict whether images and text are related and performs other algorithms for image-text retrieval without joint pre-training. It illustrates that transfer learning can also produce excellent results in cross-modal tasks. GPV [78] provides a general and task-independent system. It receives visual features and textual descriptions. Additionally, it generates bounding boxes, confidence levels, and output information. The system can learn and perform any task over a wide range without affecting the network structure. GPV consists of an optical encoder, a text encoder, and a collaborative attention module. CNN backbone and DETR transformer encoder-decoder are used to create the object detector. This also refers to ViLBERT, which can encode cross-contextual representations from visual and linguistic encoders. Since it is not feasible to collect and annotate task-specific data in all languages, a framework is urgently needed to build universal models across languages. M3P [79] provides a multilingual and multimodal pre-training paradigm that integrates them into a cohesive framework to obtain common representations. It exploits the problem of insufficient supervision of multilingual text video data and is inspired by recent achievements in large-scale language modeling and multi-modal pre-training.

Unified generative models . It can be divided into discriminative model and generative model. Several studies have investigated general frameworks from a model development perspective. Due to the development of cross-modal retrieval, a single task framework cannot meet the needs of multiple tasks. Therefore, research [80] explores a unified framework based on text generation models. This framework is simultaneously compatible with multi-modal task learning. The method is conditional text generation, that is, images and text generate text labels, and knowledge between tasks can be shared. Furthermore, UNICORN [81] connects text and bounding box formats, aiming to achieve unified visual language modeling. This model framework combines text generation and bounding box prediction to dynamically design different heads for various problems. The Pix2Seq model is a general object detection framework that inspired UNICORN. It uses a discrete approach to convert bounding box locations into discrete token sequences. Generative adversarial networks improve image synthesis by learning the underlying data distribution. However, there are few studies on other visual tasks using image generation tasks. VILLA is the first technique to integrate large-scale adversarial training to improve model generalization. It is a comprehensive framework that leverages any pre-trained model to improve the model's generalization capabilities. In other words, VILLA employs adversarial learning in the pre-training and fine-tuning phases. As a branch of self-supervised learning techniques in deep learning, unified generative models focus on defining the data production process.
Table 3 summarizes the advantages and disadvantages of the VL architecture. Universal representations offer several advantages, such as improved accuracy, generalization capabilities, and efficiency by reducing computational resources and training time for multiple tasks. However, it also faces challenges such as increased complexity, possible loss of modality-specific information, and limited interpretability due to the intricate interactions between vision and language. On the other hand, the unified generative model has the ability to generate the output of one modality based on the input of another modality, thereby performing better in cross-modal retrieval. However, these models have limited flexibility, increase in complexity during training, and have a higher risk of overfitting, mainly because they generate representations of multiple modalities simultaneously, which may require diverse training data to prevent Overfitting.
Insert image description here
Table 3. Advantages and Disadvantages of VL Architecture

loss function

The loss function will evaluate the performance of the model by comparing the expected output of the model with the expected output, and then determine the direction of optimization. If the difference between the two is particularly large, the loss value will be large. On the contrary, if the two are very different or roughly equal, the loss value will be very small. Therefore, when training a model on a dataset, an appropriate loss function is needed to correctly penalize the model. This section defines the main loss functions and performance analysis methods. We summarize innovative samples of loss functions in cross-modal tasks as shown in Figure 4.
Insert image description here
Figure 4. Innovative sample of loss function

Evaluation indicators

There are various evaluation metrics to prove the effectiveness of cross-modal retrieval. Use appropriate metrics to evaluate the effectiveness of the method in specific scenarios. In this section, the main evaluation metrics: precision (P), recall (Recall@K), PR curve (PR), average precision (mAP), F-score (FS) and normalized discounted cumulative gain (NDCG) ).

Benchmark dataset

Benchmark datasets are often used to evaluate the performance of cross-modal retrieval. Table 4 shows the analysis and interpretation of the classic cross-modal dataset, including the name of the dataset, the number of images and text, and the description.
Insert image description here
Table 4. Summary of representative datasets that facilitate cross-modal retrieval

in conclusion

Deep learning research has greatly advanced the development of cross-modal retrieval, providing elegant solutions and driving substantial progress. In this paper, we provide a comprehensive summary and analysis of numerous well-known studies and propose a taxonomy of cross-modal retrieval mechanisms. We also discuss challenges and open questions to guide future research from a representation learning perspective. To provide a holistic understanding of the full-cycle approach, we cover preprocessing, feature engineering, encoding, cross-modal interactions, decoding, model optimization, and evaluation metrics. Additionally, tables, figures, and equations were used to enhance the clarity of the primary study.
Despite extensive efforts, achieving optimal results and accuracy in cross-modal retrieval remains an ongoing challenge. Key obstacles include feature representation, complex semantic processing, visual language alignment, unified architecture, model optimization, performance evaluation metrics, and developing more comprehensive datasets .

Guess you like

Origin blog.csdn.net/zag666/article/details/132253815