Large Language Models and Knowledge Graphs: Opportunities and Challenges

Large-scale models and knowledge graphs are two aspects of knowledge expression. How to combine the two is the focus of recent industry attention. A brief introduction to the latest "Large Model and Knowledge Graph" from scholars such as the University of Edinburgh in the United Kingdom, discussing the mutual promotion of large models and knowledge graphs, it is worthy of attention!

From: Expertise

Enter the NLP group —> join the NLP exchange group

df3a500dc4ba38ffe0359fef7a3aa417.png

Large language models (LLMs) have caused a stir in the field of knowledge representation—and globally. This turning point marks a renewed focus from explicit knowledge representation to hybrid representations of both explicit and parametric knowledge. In this position paper, we discuss some common points of debate in the community about LLMs (parametric knowledge) and knowledge graphs (explicit knowledge) , and speculate on the opportunities, visions, and related research topics and challenges that this renewed focus brings .

Large Language Models and Knowledge Graphs: Opportunities and Challenges

Large language models (LLMs) have caused a stir in knowledge representation (KR) and the world at large because they have demonstrated human-level performance on a wide range of natural language tasks, including some that require human knowledge. After this, people gradually began to accept the possibility that knowledge may exist in the parameters of some language models . The advent of LLMs marked the beginning of the era of knowledge computing, in which the concept of reasoning within KR was extended to many computational tasks based on various knowledge representations.

This is a huge step for the field of knowledge representation. For a long time, people have focused on explicit knowledge, such as that embedded in text, sometimes referred to as unstructured data, and knowledge that exists in a structured form, such as in databases and knowledge graphs (KGs) [123 ]middle. Historically, people have long used text to pass their knowledge from one generation to another until around the 1960s, when researchers began to study knowledge representation to better understand natural language and developed early systems such as MIT The ELIZA [180]. In the early 2000s, the knowledge representation and Semantic Web communities collaborated to standardize widely used knowledge representation languages, such as RDF [121] and OWL [55], at the scale of the web, and using them, large-scale knowledge bases are more broadly called For KGs [123], logical reasoning and graph-based learning are achieved due to their useful graph structure. This inflection point, accompanied by the advent of LLMs, marks a paradigm shift from explicit knowledge representation to a renewed focus on hybrid representations of both explicit and parametric knowledge.

As a popular method for explicit knowledge representation, KGs are now widely studied for use in combination with Transformer-based LLMs, including pretrained masked language models (PLMs) like BERT [39] and RoBERTa [104], and more recently generative Sexual LLMs, such as the GPT series [23] and LLaMA [165]. Some works augment KGs with LLMs, e.g., knowledge extraction, KG construction, and refinement, while others augment LLMs with KGs, e.g., training and hint learning, or knowledge augmentation. In this paper, considering the two directions of LLMs for KGs and KGs for LLMs, we propose a better understanding of the transition from explicit knowledge representation to a renewed focus on hybrid representations of both explicit and parametric knowledge.

A related survey paper [204] provides a comprehensive review of KG construction and inference using LLMs, while our work provides a deeper perspective on this turning point, considering not only relational KGs but also the use of ontologies KGs as schema, and structured knowledge in other dimensions, including tabular data [183] ​​and numerical values ​​[122]. Other research at the intersection of LLMs and KGs slightly overlaps with the topics covered in our paper; Comparing GPT-4, ChatGPT, and SOTA fine-tuning methods on knowledge-related tasks—entity, relation, and event extraction, link prediction, and KG question answering [204]. Overall, none of these papers delved into the application-specific implications of this turning point. To this end, this paper summarizes common points of controversy within the community, introduces state-of-the-art on a range of topics in the integration of KGs and LLMs, and further proposes opportunities and challenges.

Knowledge graph and large language model

Combining the opportunities and visions created by the availability of both parametric and explicit knowledge, in this section we classify, summarize and present recent developments in the use of LLMs and KGs along four distinct themes.

1. LLMs for KGs: Knowledge Extraction and Normalization 

KG construction is a complex task that requires the collection and integration of information from a wide range of sources including structured, semi-structured, and unstructured data. Traditional approaches usually rely on modules specially designed to handle each data type and face difficulties when the content is diverse and the structure is heterogeneous. However, LLMs are powerful NLP models trained on a wide range of information sources, making them well suited for knowledge extraction tasks. This section presents work on knowledge extraction from various sources using LLMs.

Entity Resolution and Matching 

Entity resolution (also known as entity matching, entity linking, or entity alignment) is the process of linking pieces of information that appear in multiple heterogeneous datasets and point to the same entity [46, 50, 126]. Past research has mainly focused on developing methods and similarity measures between entities represented by flat structured data. However, entity resolution of semi-structured data for KGs is a relatively new topic that has received significantly less attention. Methods for entity alignment can be divided into general methods and embedding-based categories. General methods, such as CG-MuAlign [203], use graph neural networks (GNNs) to perform multi-type entity alignment, exploit proximity information and generalize to unlabeled types, and REA [129], by combining adversarial training with GNNs to Addresses the problem of multilingual entity alignment for input noisy labeled data. Embedding-based entity alignment methods reduce the symbolic similarity between graph entities into a vector space to remove heterogeneity of graph components and facilitate inference [156]. Specifically, a total of 23 representative embedding alignment methods are cross-compared in terms of performance, but also show that they require a lot of supervision in the labeling stage. Therefore, unsupervised methods and methods capable of handling large-scale KGs are very welcome in future research investigations. LLMs have multiple uses in entity resolution and linking of KGs [7]. First, LLMs can help label training data, which is usually a resource-intensive and time-consuming step that hinders the entity alignment performance of KGs. Similar to [146] using generative adversarial networks (GANs) to reduce labeled data efforts, we argue that LLMs can provide labeled samples of KGs and control the performance of the above embedding-based methods. Furthermore, LLMs can help build a robust corpus of entity matching rules, as long as a declarative formalized logic language L is defined in the graph setting. The training data for this logical language should be provided as input to LLMs, similar to SQL statements available for consumption in a text corpus. However, hint engineering is required to produce regular corpora meaningful for practical large-scale KGs like DBpedia [9] and Wikidata [169]. It is conceivable to provide entity matching rule logs for these practical large-scale KGs, similar to query logs for these KGs [18, 19]. In conclusion, entity alignment and matching are necessary preprocessing steps for full knowledge reasoning. Combining general entity linking approaches with embedding-based approaches, as well as leveraging LLM-driven rules and labeled data construction, can better integrate LLMs with knowledge reasoning [66]. The latter integrating LLMs and knowledge reasoning can also improve performance, thereby making the output of the model explainable and explainable, and filling the gap between symbolic and statistical AI.

Knowledge Extraction from Tabular Data

Extracting knowledge from tabular data such as databases, web tables, and CSV files is a common way to construct KGs. For tables with known semantics (meta information), heuristic rules can be defined and used to transform their data into KG facts. However, real-world tables often have ambiguous semantics, and important meta-information such as table names and column headings are not clearly defined. At the same time, raw data often needs to be retrieved, explored, integrated, and curated before desired knowledge can be extracted.

In recent years, Transformer-based LMs have been studied for processing tables, especially their text content. They can be applied to tabular vector representations as the basis for other prediction tasks [168]. TURL [38] is a typical tabular representation learning method using BERT [39], which has been applied to multiple tasks such as cell population, column type annotation and relation extraction. Similarly, RPT [162] uses BERT and GPT for pre-training of tabular representation models. Starmie [47] uses templates to convert columns into sequences, and fine-tunes BERT using associative and non-associative column pairs as samples, using a contrastive learning framework.

In all table processing tasks, semantic table annotations that match table data to KG components (e.g. table columns to KG classes, table cells to KG entities, inter-column relations to KG attributes) can be directly applied to extract knowledge, For KG construction and padding [103, 76]. There have been several attempts to use LLMs for these tasks. Doduo [155] serializes tables into a sequence of tokens and trains BERT to predict column types and inter-column relationships. Korini et al. [86] hint ChatGPT to annotate semantic column types. When task-specific examples are rare or non-existent, ChatGPT performs similarly to the RoBERTa model.

Although attention has been paid to the utilization of LLMs for tabular data processing and KG construction, there is still much room for research, especially with the following challenges :

  • Transforming table content into a sequence: A table or a table element with its structured context needs to be converted into a sequence before it can be input into LLMs. Different conversion methods are required for different LLM utilization scenarios, such as fine-tuning LLMs, hinted LLM inference, and guided tuning of LLMs.

  • Represent and utilize non-text tabular data: Tables often contain not only long and short text, but also other types of data such as numbers and dates. There is also little work considering these data.

  • Extracting tabular knowledge: LLMs are mainly used to process and understand tables, but are rarely applied in the final step of knowledge extraction. OntoGPT [25] is known to use ChatGPT to extract instances from text to populate ontologies, but there is no corresponding tool for tables. Apart from instances, extracting relational facts is more challenging.

Extract knowledge from text

Extracting knowledge from text usually involves automatically extracting entities and their related relations, and traditional pipelines handle a large number of sentences and documents. This process enables raw text to be transformed into actionable knowledge, which facilitates various applications such as information retrieval, recommender systems, and KG construction. The language understanding capabilities of LLMs have enhanced this process. For example,

  1. Named Entity Recognition (NER) and Entity Linking: As described in Section 4.1.1, involves identifying and classifying named entities (such as people, organizations, and places) in text and linking them (see Section 4.2.1 for more) to KGs.

  2. Relation Extraction: Focuses on identifying and classifying relations between entities, leveraging zero-shot and few-shot contextual learning techniques of LLMs [178, 93].

  3. Event extraction: Aims at detecting and classifying events mentioned in text, including their actors and attributes [170, 194].

  4. Semantic Role Labeling (SRL): involves identifying the roles played by entities in sentences, such as subject, object, and predicate [148, 199].

These methods allow LLMs to extract information from text without extensive explicit training in a specific domain, thereby increasing their versatility and adaptability. Furthermore, LLMs have demonstrated the ability to extract knowledge from languages ​​other than English, including low-resource languages, paving the way for cross-lingual knowledge extraction and enabling LLMs to be used in multiple language contexts [89].

Furthermore, prompting for LLMs introduces new paradigms and possibilities in the field of NLP. LLMs can generate high-quality synthetic data that can then be used to fine-tune smaller task-specific models. This approach, known as synthetic data generation, addresses the challenge of limited training data availability and improves model performance [77, 163]. Furthermore, guided tuning has emerged as a powerful technique, where LLMs are trained on datasets described by explicit instructions, enabling more precise control and tailoring of their behavior to specific tasks [178, 174].

Also, for constructing domain-specific KGs, the stakes are higher, so review of the generated text (by experts) is necessary. However, this is still an improvement since human annotation is less expensive than human text generation. In addition to the massive computational resource requirements required to train and utilize these LLMs, there are various challenges, including those mentioned in Section 2. More specifically, the following future directions are still possible :

  • Efficient Extraction from Very Long Documents : Current LLMs cannot process very long documents like novels in one go. In this regard, modeling long-range dependencies and performing corpus-level information extraction can be further improved.

  • High-coverage information extraction : Almost all extraction pipelines focus on high accuracy. However, high returns have been overlooked or underexplored [152]. Building knowledge extractors with high accuracy and high return will be a giant leap forward in building lifelong information extractors.

4.2 LLMs for KGs

Knowledge Graph Construction We highlight the important role of LLMs in improving knowledge graph construction, focusing on current trends, issues, and unanswered questions in this field. We first discuss link prediction, a method for generating new facts from existing knowledge graphs. Next, we examine inductive link prediction, a method for predicting triples of unseen relations. Our focus then shifts to a more recent approach where triples are extracted directly from the LLM's parameter knowledge. As the conclusion of this section, we discuss the challenges of LLM-based knowledge graph construction methods. These challenges involve long-tail entities, numerical values, and the precision of these methods.

4.3 LLMs for KGs Ontology Model Construction 

A knowledge graph is usually equipped with an ontology schema (including rules, constraints, and ontologies) to ensure quality, make knowledge access easier, support reasoning, etc. At the same time, an independent ontology, which usually represents conceptual knowledge and sometimes logic, can also be regarded as a knowledge graph. In this section, we introduce the topic of LLMs being applied to learning ontology schemas and managing ontologies.

4.4 KGs provide support for LLMs: training and access 

LLMs In Sections 4.1 to 4.3, we discussed three different aspects of using LLMs to support KGs. In this section, we investigate the opposite direction, using KGs to power LLMs. There are several dimensions here. First, KGs can be used as training data for LLMs. Second, triplets in KGs can be used to construct hints. Last but not least, KGs can be used to retrieve extrinsic knowledge in augmented language models.

4.5 Applications 

Integrating KGs and LLMs into a unified approach has great potential, as their combination can enhance each other and complement each other in valuable ways. For example, KGs provide very accurate and unambiguous knowledge, which is crucial for some applications such as healthcare, while LLMs have been criticized for leading to hallucinations and inaccurate facts due to lack of factual knowledge. Second, LLMs lack interpretability, whereas KGs are able to generate interpretable results due to their symbolic reasoning capabilities. On the other hand, constructing KGs from unstructured text is difficult and incomplete, thus, LLMs can be utilized to address these challenges through text processing. Various applications have adopted this approach of combining LLMs with KGs, such as medical assistants, question answering systems [188] or ChatBots, and sustainability etc.

in conclusion

In recent years, the progress of large-scale language models (LLMs) has marked an important turning point for knowledge graph (KG) research. Although the important question of how to combine their strengths remains open, this presents exciting opportunities for future research. The community has rapidly adjusted its research focus, new forums such as the KBC-LM Workshop [79] and the LM-KBC Challenge [151] have emerged, and resources have shifted heavily to hybrid approaches to knowledge extraction, integration, and use. We make the following recommendations :

  1. Don't discard KGs because of a paradigm shift: For a range of reliability- or safety-critical applications, structured knowledge remains indispensable, and we have outlined multiple ways in which KGs and LLMs can enhance each other. KGs are here to stay, don't ditch them just for fashion.

  2. Kill your darling: LLMs have greatly advanced many tasks in the KG and ontology construction pipelines, and even made some tasks obsolete. The most established pipeline components are rigorously reviewed and constantly compared with the latest LLM-based technologies.

  3. Be curious, be critical: LLMs are undoubtedly the most impressive outgrowth of AI research in the past few years. Nonetheless, there are plenty of exaggerated claims and expectations in both the public and research literature, and one should maintain a dose of critical reflection. In particular, no fundamental solution to the problem of so-called hallucinations has yet emerged.

  4. The past is over, let us start a new journey: Advances triggered by LLMs have disrupted the field in unprecedented ways and enabled important shortcuts into the field. Now is the best time to start a new journey in the fields related to knowledge computing. Although the current direction of shift is broadly open, as researchers continue to explore the potential and challenges of hybrid methods, we can expect to see new breakthroughs in the representation and processing of knowledge, which will have important implications for everything from knowledge computing to NLP, AI and beyond. field has a profound impact.


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132486391