A Taste of Paper | Completion of Multimodal Knowledge Graph Based on Interaction Modal Fusion

df25d43fbec1cc0c0c8f0b0e16e142b3.png

Notes arrangement: Zhang Yichi, Master of Zhejiang University, research direction is multi-modal knowledge graph

Link: https://arxiv.org/abs/2303.10816

motivation

Multimodal knowledge graph completion needs to integrate information of various modalities (such as images and texts) into the structural representation of entities to achieve better link prediction, but existing methods often combine all modal Projecting into a unified space with the same relationship to capture commonalities, which may fail to preserve specific information in each modality. Therefore, they cannot effectively model the complex interactions between modalities to capture the intermodal interactivity, which leads to limited performance of these methods.

contribute

To address this issue mentioned above, a novel interactive multimodal fusion model (IMF) is proposed for multimodal link prediction on knowledge graphs. IMF can learn knowledge independently in each modality and jointly model complex interactions between different modalities through two-stage fusion.

In the multimodal fusion stage, the authors employ a bilinear fusion mechanism to fully capture the complex interactions between multimodal features through contrastive learning. For the basic link prediction model, the authors take relational information as context and list triples as predictions in each modality. In the final decision fusion stage, the authors integrate predictions from different modalities and utilize complementary information to make a final prediction. The contributions of this paper are summarized as follows:

  • The authors propose a two-stage operational model, the IMF, which plays a role in integrating complementary information from different modalities for link prediction.

  • We design an efficient multimodal fusion module to jointly model commonality and complementarity by contrastive learning to capture bilinear interactions.

  • The authors conduct extensive experiments on four widely used multimodal link prediction datasets, demonstrating the effectiveness and generality of IMF.

method

The overall architecture diagram of the method proposed by the author is shown in the figure below. The method mainly includes a modal information fusion module and a decision fusion module (joint reasoning module).

87d3fb8c68c5f696ec217085e708f845.png

In the modal information fusion module, the author designs a modal fusion mechanism referring to the Tucker tensor decomposition model, which combines the three modal representations (referred to as structural representation, image representation and text representation) obtained through different modal feature encoders. Representation) are first projected into a new representation space, and then the multimodal representation vector of each entity is obtained through tensor point multiplication operation. This process can be expressed as:

54295138fefb9f27c2d873aa7fb7044e.png

Then the author proposes to compare and learn the three modalities in pairs, so that the different modalities can fully interact and maximize the mutual information. This process of comparative learning can be expressed as:

382583def20f217520fa8b04d8a9a048.png

At the same time, for each modality k, the author designs a contextual relationship model, which uses the relationship projection matrix to introduce the context information of the triplet into the representation of the entity. Candidate entities calculate the similarity and use the cross-entropy loss function as the training target of the model. This process can be expressed as:

527cd93f67551045bb8a790646753b3b.png

In the decision fusion stage, the author weighted and summed the prediction loss function of each modality through a set of learnable parameters, and added the aforementioned contrastive learning loss. This process can be expressed as:

01edca39036b5e8805781466b49442b9.png

In the inference stage, the model will use the learned weights to weight and sum the scores of different modalities, and perform the final link prediction. This process can be expressed as:

e8532dd4194373c8311f673b6a433bed.png

experiment

In the experimental part, the author conducted experiments on four multimodal knowledge map datasets of DB15K, FB15K, YAGO15K and FB15K-237, and compared them with multiple baseline models (including several single-modal models and multi-modal models). , the experimental results are as follows:

052765d0d9bd219c28123a2cbe675e3a.png

5db6dcb1e3cd70d7e1cf28cb10f03b90.png

From the experimental results, it can be seen that the method proposed in this paper has made great progress compared with the existing models. At the same time, the author shows through ablation experiments that the modality fusion module, decision fusion module and comparative learning module all contribute to the final result of the model. There is a significant improvement, among which, the gain brought by the mode fusion module to the model is the most obvious.

066a8c275f2b93df13bdd25f4c5be1be.png

In addition, the author verifies the generality of the method by using the interactive modal feature fusion proposed by the author on different scoring functions. The experimental results of this part are shown in the bar chart above. At the same time, the author made a very interesting visualization, projecting the four modal representations of multiple players in multiple teams into a two-dimensional space, as shown in the following figure:

be4e10c5d291e5f9895bf4bba3555131.png

From the visualization results, it can be found that before the fusion, the distribution of the structure representation, image representation and text representation of different players is difficult to find the law, and in the multi-modal representation after the fusion of modal features, the representation vectors of players from different teams The distribution shows a certain pattern. The representation vectors of players in the same team are close to each other, while the representation vectors of players in different teams are far away from each other. role.

Summarize

In this paper, we study the problem of link prediction on multimodal knowledge graphs. Specifically, the authors aim to improve the interaction between different modalities. To achieve this goal, the authors propose a two-stage framework for the IMF by (i) exploiting bilinear fusion to fully capture the complementarity between different modalities and augmenting different modalities of the same entity through contrastive learning stronger correlation among them, thus enabling effective fusion of multimodal information; and (ii) employing an ensemble loss function to jointly consider the prediction of multimodal representations. Experimental results on several benchmark datasets demonstrate the effectiveness of our proposed model. In addition, the authors conduct an in-depth exploration to illustrate the generality of the proposed method and the potential opportunities to apply it to practical applications.


OpenKG

OpenKG (Chinese Open Knowledge Graph) aims to promote the openness, interconnection and crowdsourcing of knowledge graph data with Chinese as the core, and promote the open source and open source of knowledge graph algorithms, tools and platforms.

d45544a72810341ad520f59bbaea8f1f.png

Click to read the original text and enter the OpenKG website.

Guess you like

Origin blog.csdn.net/TgqDT3gGaMdkHasLZv/article/details/130355065