[Paper Interpretation] Multimodal graph learning for generation tasks

1. Brief introduction

Multimodal learning combines multiple data modalities, broadening the type and complexity of data that models can exploit: for example, from plain text to image mapping pairs. Most multimodal learning algorithms focus on modeling simple one-to-one data pairs from two modalities, such as image-caption pairs, or audio-text pairs. However, in most real-world situations, entities of different modes interact in more complex and multifaceted ways that go beyond a one-to-one mapping. The paper proposes to represent these complex relationships as graphs, allowing the paper to capture data for any number of patterns and use complex relationships between patterns that can flexibly change between different samples. To achieve this goal, the paper proposes Multimodal Graph Learning (MMGL), a general yet systematic framework for capturing information from multiple multimodal neighborhoods with relational structures. In particular, the paper focuses on MMGL for generation tasks, building on pre-trained language models (LMs) and aiming to enhance their text generation with multi-modal neighborhood context. The paper studies three research questions raised by MMGL: (1) How to inject multiple neighborhood information into the pre-trained LM while avoiding scalability problems, thereby avoiding scalability problems? (2) How to inject graph structure information between multi-modal neighborhoods into LM? (3) How does the paper adjust the pre-trained LM to learn from the neighborhood context in a parameter-efficient manner? The paper conducts extensive experiments to answer these three questions on MMGL and analyzes the empirical results, paving the way for future MMGL research.

2. Research background

There are different data modalities in real-world applications, ranging from common text, images and videos to time series data or domain-specific modalities such as protein sequences. These different modalities are collected not individually but together with the multifaceted relationships between them. Wikipedia is one of the most popular sources of multimodal web content, providing multimodal data such as text, images, and titles. Meta's recently launched website uses each user's multimodal data to build a personal timeline, including their photos, maps, shopping and music history. In addition to these examples, important industrial and medical decisions are also made by considering a variety of multimodal data, such as images, tables or audio. These multimodal data complicate many-to-many relationships between their multimodal entities—which can be represented by graphs—providing an open research space on how to comprehensively understand them.

With the rise of multimodal datasets, various groundbreaking studies have been conducted on multimodal learning. Previously, multimodal learning focused on new architectures, extending transformers or graph neural networks, and training them from scratch using large-scale multimodal datasets. Motivated by the powerful generative capabilities of pre-trained language models (LMs), recent multi-modal methods build on pre-trained LMs and focus on the generation of multi-modal content. For example, previous work uses pre-trained image encoders and LM to generate images/text based on given text/images. However, all existing models assume that a pair of patterns with a clear 1-to-1 mapping is provided as input (e.g., the image-caption pair in Figure 1(a)). Therefore, they cannot be directly applied to multimodal datasets with more general many-to-many mappings between modalities (e.g., the multimodal Wikipedia page in Figure 1(b))

Here, we extend the scope of multimodal learning from 1-to-1 mapping into multimodal graph learning (MMGL) while maintaining generative capabilities by integrating them into pre-trained LMs. The paper introduces a systematic framework that illustrates how MMGL processes multi-modal neighborhood information with graph structure and generates free-form text using pre-trained LM (Figure 2). The paper’s MMGL framework extracts neighborhood codes, combines them with graph structure information, and uses parameter-efficient fine-tuning to optimize the model. Therefore, the paper defines three design spaces to study the three research questions of MMGL as follows:

Research Question 1: How does the paper provide multiple multi-modal neighborhood information for LM while avoiding scalability issues?

Research Question 2: How to inject graph structure information between multi-modal neighborhoods into LM?

Research Question 3: How does the paper adjust the pre-trained LM to learn through multi-modal neighborhood information in a parameter-efficient manner?

In traditional multimodal learning with a 1-to-1 mapping assumption, usually only one neighborhood is provided (e.g., an image for a text caption). In contrast, MMGL needs to handle several neighborhoods with different data sizes (e.g., image resolutions and text sequences of different lengths), which leads to scalability issues. For research question 1, the paper studies three neighborhood encoding models: (1) Self-attention using text+embedding (SA-Text+embedding) uses a frozen encoder to pre-compute image embeddings and then compare them with those from the neighborhood. The original text is concatenated into the input text sequence, (2) Embeddings of text and image patterns are pre-computed using a frozen encoder using self-attention on embeddings (SA-embedding) and concatenated to the input text, (3) Crossover of embeddings is used Attention (ca-embedding) inputs pre-computed text or image embeddings into the cross-attention layer of the LM.

In Research Question 2, the paper investigates how to inject graph structure information between multi-modal neighborhoods into LM (e.g., part hierarchy and image order in Figure 1(b)). The paper compares sequence position encoding with two graph position encodings widely used in graph converters: Laplacian eigenvector position encoding (LPE) and graph neural network encoding (GNN), which use graph structures to precompute before input Run GNN on the neighborhood embedding.

Research Question 3 attempts to improve cost and memory efficiency compared to fully fine-tuned LM. In this work, the paper explores three parameter efficient fine-tuning (PEFT) methods: prefix tuning, LoRA, and Flamingo tuning. Which PEFT methods are used depends on the neighborhood encoding model: when neighborhood information is concatenated into the input sequence (SA-Text+embedding or SA-embedding neighborhood encoding), the paper can apply prefix tuning or LoRA for fine-tuning. When neighborhood information is input into the cross-attention layer (ca-embedding), the paper applies Flamingo tuning to perform stable fine-tuning only on the cross-attention layer with the gating module.

Based on the paper's MMGL framework, the paper conducts extensive experiments on the recently released multi-modal dataset WikiWeb2M. WikiWeb2M unifies the content of each Wikipedia web page and includes all text, images, and their structure into a single example. This makes it useful for studying multimodal content understanding using many-to-many text and image relationships in generative tasks. Here, the paper focuses on the part summarization task, which aims to generate a sentence that captures information about a part of content by understanding the multimodal content on each Wikipedia page. Through rigorous testing of WikiWeb2M, the paper provides intuitive empirical answers to the research questions raised in MMGL.

To sum up, the contributions of this paper are:

Multimodal Graph Learning (MMGL): The paper introduces a systematic MMGL framework for processing neighborhood information of multimodal graph structures and generates free-form text using pre-trained LM.

Principle research questions: The paper introduces three research questions that MMGL needs to answer: (1) How to provide multiple neighborhood information to pre-trained LMs, (2) How to inject graph structure information into LMs, (3) How Efficiently fine-tune LMs=parameters. This paves the research direction for future MMGL research.

Extensive empirical results: The paper shows empirically that (1) neighborhood context improves generation performance, (2) SA-Text+embedding neighborhood encoding shows the highest performance at the expense of scalability, (3) GNN embedding is the most efficient graph Position encoding, and (4) SA-Text+embedding neighborhood encoding LoRA and ca-embedding neighborhood encoding with Flamingo tuning show the highest performance among different PEFT models.

3. Multimodal Graph Learning for Generative Tasks

Given a multimodal graph with text or image on each node, the goal of the paper is to generate text conditioned on each node and its neighboring nodes. More specifically, given a text input on a target node, the pre-trained LM generates free-form text based on the input text and the multimodal context around the target node. In the paper's multi-modal graph learning (MMGL) framework, the paper first uses a frozen encoder to encode the information of each neighborhood separately (Figure 2(b)). Freeze encoders can be pre-trained ViT or ResNeT for mapping pixels to embedded images, and pre-trained LM for mapping text to embedded text (similar to other modalities). Then, the paper uses graph position encoding to encode the graph structure around the target node (Figure 2(c)). Finally, the encoded neighborhood information with graph position encoding is input into the LM through the input text to generate text based on multi-modal input content (Figure 2(d)).

This framework leaves three design spaces for the paper: (1) How does the paper provide neighborhood information to LM? (2) How to inject graph structure information between multi-modal neighborhoods into LM? (3) How does the paper adjust the pre-trained LM to effectively learn from neighborhood context parameters? In this section, the paper will examine each problem and discuss possible ways in which the paper can be applied.

3.1 Research Question 1: Neighborhood Coding

Unlike existing multimodal learning that assumes a single image (corresponding to the input text) as input, multimodal graph learning considers an arbitrary number of neighborhood images/texts as input; therefore, scalability is achieved from multiple multimodal The first problem that needs to be solved in neighborhood learning. In visual-text models, the standard approach is to first process the images into image embeddings using an image encoder (e.g., ViT, ResNet), then map the embeddings to a text-only LM space, and finally feed them into the LM . Two popular methods for inputting image embeddings into LMs are full self-attention to modalities connected across sequence dimensions or with cross-modal attention layers.

Based on these two methods, the paper proposes the following three neighborhood coding methods:

Self-attention using text+embedding (SA-Text+embedding): text neighborhoods are concatenated as raw text, while other patterns are first processed by a frozen encoder (e.g., ViT for images), and then their embeddings are concatenated input sequence. The paper adds a linear mapper that aligns precomputed embeddings into the text space of lLM.

Self-attention using embedding (SA-embedding): With SA-Text+embedding, except that text neighborhoods are also processed by separate frozen encoders, their embeddings are concatenated to the input sequence. The text encoder can be the same as or different from the base LLM model.

Cross-attention using embedding (ca-embedding): all neighborhoods are processed by separate frozen encoders, mapped to the text space through a linear mapper, and then fed into the cross-attention layer.

In general, when papers provide text embeddings instead of raw text, the amount of information LLM can exploit is limited by the precomputed embeddings. However, since the attention mechanism of LM uses O(T2) computation with sequence length T, the original text introduces scalability issues. Therefore, there is a trade-off between computational cost and scalability. For SA-Text+embedding and SA-embedding, the paper only has extra parameters for the mapper located outside the LM, while ca-embedding inserts additional cross-attention layers into the pre-trained LM and trains them from scratch. This means that ca-embedding may lead to an unstable initial state because the pre-trained LLM layers are affected by randomly initialized cross-attention layers. In Section 4.4, the paper explores these three methods and discusses their empirical results.

3.2 Research question 2: Structural encoding of graphs

Given neighborhood information, the paper can simply concatenate the neighborhood information as raw text or embedded information and process them as a sequence. But there is structure between neighborhoods. For example, sections have a hierarchical structure and images are included in some sections in WikiWeb2M (Figure 1(b)). To encode this graph structure in neighborhood information, the paper borrows two popular graph position encodings from graph transformers and compares them with sequential position encodings.

Laplacian position encoding (LPE): The paper uses Laplacian feature vectors calculated from the graph structure of neighborhoods as their position encoding.

Graph Neural Network (GNN): The paper first computes neighborhood embeddings from a frozen encoder and runs the GNN on the embeddings using a graph structure. We then use the output GNN embedding, which encodes graph structure information as positional encoding.

LPE has an additional 1-layer MLP mapper to map Laplacian feature vectors to the text space of LM. Parameters used for graph structure encoding (e.g., mappers of LPE or GNN parameters) are trained in an end-to-end manner during LM fine-tuning. In Section 4.5, the paper will explore how these different positional encodings can bring additional graph structure information between neighborhoods into LM and improve performance.

3.3 Research Question 3: Parameter-Efficiency

Although the paper needs to fine-tune the pre-trained LM model for specific tasks and newly added neighborhood information, complete fine-tuning requires high computational costs and also brings problems to the shared MMGL module when the user decides to use neighborhood information. It's an inconvenience. In recent years, various parameter efficient fine-tuning (PEFT) methods have been proposed to fine-tune only a small number of parameters while maintaining complete fine-tuning performance. The paper selected three different PEFT models suitable for the three neighborhood coding methods described above in the paper.

Prefix tuning: When the paper chooses SA-Text+embedding or SA-embedding as the neighborhood encoding, there are no newly added parameters except the self-attention layer; therefore, the paper can easily apply prefix tuning, which maintains the language model Parameters are frozen, and a continuous task-specific vector sequence of raw activation vectors in all layers is optimized.

LoRA: Like prefix tuning, low-rank adaptation (LoRA) also works with SA-Text+embedding or SA-embedding neighborhood encoding. LoRA injects each layer with a trainable rank decomposition matrix while freezing the original parameters.

Flamingo: For ca-embedding neighborhood encoding, the paper can directly apply Flamingo, which only fine-tunes the cross-attention layer of the newly added tanh gate to keep the initialization pre-trained LM intact during initialization to improve stability and performance. .

In Section 4.6, the paper will explore how the PEFT model maintains complete fine-tuning performance by tuning a small number of parameters.

4. Experiment

4.1 WikiWeb2M Dataset

The WikiWeb2M dataset is built for general research on multimodal content understanding with many-to-many text and image relationships. WikiWeb2M is built on the basis of the WIT dataset, which includes page titles, section titles, section texts, images and their titles, as well as indexes for each section, parent sections, child sections, and so on.

In this work, the paper focuses on the part summarization task to generate a single sentence that highlights the content of a specific part. The summary is generated given all images and (non-summary) text that appear in the target and context sections. The paper randomly selects 600k Wikipedia pages from Wikiweb2M for partial summarization tasks. Overall, the training/validation/test set sizes for the partial summary task are 680k/170k/170k respectively.

4.2 Experimental setup

From WikiWeb2M, papers can obtain four types of information: (1) partial text, (2) partial images, (3) page descriptions and other parts of text, (4) page descriptions and other parts of images. The paper gradually provides information to LM to study the effectiveness of multi-modal neighborhood information: (1) part text, 2) all part text (text + image), 3) page text (all from the Wikipedia page to which the input part belongs) , 4) All pages (all text and images from Wikipedia pages).

The paper uses Open pre-trained transformer (OPT-125m) to read the input part of the text and generate a summary for the basic LM. For text and image encoders that obtain neighborhood information, the paper uses a text/image encoder from CLIP. , the paper fine-tuned the OPT of 10,000 steps with a batch size of 125 and a learning rate of 10−4. The text/image encoder was frozen in all experiments. The paper measured BLEU-4, ROUGE-L and CIDEr scores on the validation set. All experiments are run on 4 Nvidia-RTX 3090 GPUs with 24GB memory.

4.3 Validity of neighborhood information

The paper first studies the effectiveness of multi-modal neighborhood information. As mentioned in Section 4.2, the paper gradually provides more information to the basic LM: (1) part text, (2) all parts (text + images), (3) page text and 4) all pages (all text and images) . Here, the paper uses self-attention with text+embedding (SA-text+embedding) for neighborhood encoding across different input types. For images, the paper first computes image embeddings from a frozen CLIP image encoder and concatenates them after the text of the part to which each image belongs to preserve structure. The results in Table 1 show that more multimodal neighborhood information is useful: performance improves significantly when going from part content to page content and adding pages according to their BLEU-4, ROUGE-L and CIDEr scores Performance is further improved when working with all content.

Discussion: Missing pattern. Despite the addition of partial images, performance dropped slightly for all sections compared to partial text. In Wikipedia, not every section has a corresponding image. Therefore, in all parts of the case, the input to the LM is inconsistent with some samples having both text and images, while other samples have only text. This points to an important unsolved missing modality problem that is common in the real world, which is not typically encountered in traditional 1-to-1 multimodal settings, emphasizing the importance of developing MMGLs for the presence of missing modalities. The importance of method.

4.4 Neighborhood coding

The paper uses three different neighborhood encodings to encode multiple multi-modal neighborhood information, namely self-attention using text + embedding (SA-TE), self-attention using embedding (SA-E) and using embedding of cross-attention (CA-E). SA-E and CA-E use a frozen encoder to encode all patterns, including text, into embeddings, while SA-TE encodes text neighborhoods by concatenating to input text sequences. Therefore, SA-TE requires a longer input sequence length (1024) to encode the additional text, leading to potential scalability issues. SA-E and CA-E, on the other hand, require a token length to encode a text neighborhood, thus improving scalability through shorter input length (512). The results in Table 2 show that there is a trade-off between scalability and performance: SA-TE always outperforms SA-E and CA-E across different input types, but with longer input lengths.

Discussion: Information is lost. In traditional multi-modal learning with 1-1 mapping, SA-TE is usually used to inject text input, while image input as embedding is pre-computed by the freeze encoder. These methods successfully generate text based on input images, showing the effectiveness of image embeddings as input to pre-trained LMs.

However, the performance gap between SA-TE and SA-E in Table 2 shows that text embedding may lead to information loss in LM. This may be because the 1-layer MLP mapper embedding pre-computed text into the text space is not expressive enough for the LM, or because longer input text than short text used for traditional multi-modal learning (e.g., one-sentence captions) makes it difficult for the LM to derive from Precomputed text embeddings. From a practical perspective, the paper's results illustrate the trade-off between scalability and performance. At the same time, the paper's findings highlight the need for more MMGL research to address the challenging problem of information loss when using embeddings to capture textual information.

4.5 Graph structure encoding

In addition to each modality on the neighborhood, multimodal graphs also contain graph structure information between neighborhoods. The paper uses sequential position encoding (sequence), graph neural network embedding (GNN) and Laplacian position encoding (LPE) to encode the graph structure between multi-modal neighborhoods. The computed positional encodings are first mapped to the text space of LMs via a 1-layer MLP, added to the input token/text/image embeddings, and fed into the LMs. In Table 3, GNN embedding shows the best performance. In particular, the improvement of sequence position encoding demonstrates the importance of graph-aware structure encoding methods in MMGL.

4.6 Efficient fine-tuning of parameters

Fully fine-tuning a pre-trained LM requires high computational cost. In order to realize efficient parameter fine-tuning of MMGL, the paper studies prefix tuning and LoRA of text + embedding (SA-TE) and self-attention embedding (SA-E) neighborhood coding. For cross-attention embedding (CA-E) neighborhood coding, the paper uses flamingo-style fine-tuning and only adds a cross-attention layer with a gate module.

The results in Table 4 show that for SA-TE and SA-E neighborhood coding with more fine-tuning parameters, LoRA tuning (7−9% for prefix tuning and 26−33% for LoRA). However, prefix tuning still shows comparable performance to LoRA using SA-TE neighborhood encoding using nearly 4x fewer parameters. CA-E neighborhood coded Flamingo has similar performance to LoRA (82M for LoRA, 90M). Note that SA-E and CA-E neighborhood encodings have more parameters than SA-TE due to the inclusion of a (frozen) text encoder for text neighborhood processing.

In Table 2 (without PEFT), the performance of CA-E neighborhood coding lags significantly behind that of SA-TE neighborhood coding. However, when injected into Flamingo, the gating module in Flamingo effectively ensures that the pre-trained LM is not affected by randomly set cross-attention layers during initialization, thereby improving the performance of CA-E, as shown in Table 4 (compared with PEFT ) shown. This emphasizes the critical role of strategy initialization when introducing complementary modules of neighborhood encoding in MMGL and integrating them into pre-trained LMs.

5. Summary

In this work, we extend traditional multimodal learning with one-to-one mapping between a pair of modalities to multimodal graph learning (MMGL) for many-to-many relationships between multiple modalities. The MMGL framework of the paper is structured around three key components: (1) neighborhood encoding, (2) graph structure encoding and (3) efficient fine-tuning of parameters. Through rigorous testing on the WikiWeb2M dataset, the paper explores different options for each component: (1) Neighborhood encoding, self-attention using text + embedding (SA-Text+embedding) Self-attention using embedding (SA- embedding) and cross-attention using embedding (ca-embedding), emphasizing the balance between scalability and performance, (2) three different graph position encodings, sequence, LPE and GNN, (3) three PEFT models, Prefix tuning, LoRA and Flamingo, and their trade-offs between parameter efficiency and performance. The in-depth analysis and findings of the paper aim to lay the foundation for future MMGL research and trigger further exploration in this field.

Guess you like

Origin blog.csdn.net/INTSIG/article/details/134328073