Application of graph technology under LLM: Llama Index, a large language model driven by knowledge graph

LLM has been in full swing for more than half a year, and various large models and related frameworks have gradually taken shape, which can be applied to business practice by everyone. In this process, we may encounter a type of problem: which existing data and how to better connect with LLM. Like the knowledge graph that everyone is using, how can the current graph use the large model to exert greater value?

In this article, I will share with you how to use knowledge graphs to build better In-context Learning large language model applications.

This article was originally written in English , and I asked ChatGPT to translate it into English for me. The following is the translated prompt:

“In this thread, you are a Chinese Tech blogger to help translate my blog in markdown from English into Chinese, the blog style is clear, fun yet professional. I will paste chapters in markdown to you and you will send back the translated and polished version.”

Paradigms for LLM Applied

A major breakthrough in cognitive intelligence, LLM has transformed many industries, automating, accelerating, and enabling in ways we didn't expect. We see new LLN applications being created every day, and we are still discovering new ways and use cases of how to harness this magic.

One of the most typical patterns for introducing LLM into a process is to ask LLM to understand things based on proprietary/domain-specific knowledge. Currently, we can add two paradigms to LLM to capture this knowledge: fine-tune - fine-tune and contextual learning - in-context learning .

Fine-tuning refers to additional training of the LLM model to add additional knowledge; while contextual learning is to add some additional knowledge to query hints.

It has been observed that contextual learning is currently preferred over fine-tuning because it is simpler than fine-tuning, and this phenomenon is described in this paper: https://arxiv.org/abs/2305.16938 .

Below, I share the work NebulaGraph has done on contextual learning methods.

Llama Index: Interface between data and LLM

contextual learning

The basic idea of ​​contextual learning is to use existing LLMs (not updated) to handle special tasks for specific knowledge datasets .

For example, to build an application that can answer any question about a person, or even act as a digital avatar of a person, we can apply contextual learning to an autobiographical book and LLM. In practice, the application will construct a prompt using the user's question and some information "searched" from the book, and then query the LLM for the answer.

┌───────┐         ┌─────────────────┐         ┌─────────┐
│       │         │ Docs/Knowledge  │         │         │
│       │         └─────────────────┘         │         │
│ User  │─────────────────────────────────────▶   LLM   │
│       │                                     │         │
│       │                                     │         │
└───────┘                                     └─────────┘

In this search approach, one of the most efficient ways to achieve task-specific information from documents/knowledge (the book in the above example) is to leverage embedding.

Embedding

Embeddings generally refer to methods of mapping real-world things to vectors in multidimensional space . For example, we can map images into a (64 x 64) dimensional space, and if the mapping is good enough, the distance between two images can reflect their similarity.

Another example of embedding is the word2vec algorithm, which maps each word into a vector. For example, if the embeddings are good enough, we can perform addition and subtraction operations on them and might get the following result:

vec(apple) + vec(pie) ≈ vec("apple apie"), or the vector measure vec(apple) + vec(pie) - vec("apple apie")approaches 0:

|vec(apple) + vec(pie) - vec("apple apie")| ≈ 0

Similarly, "pear" should be closer to "apple" than "dinosaur":|vec(apple) - vec(pear)| < |vec(apple) - vec(dinosaur)|

With this foundation, we could theoretically search for book fragments that are more relevant to a given question. The basic process is as follows:

  • Split the book into small pieces, create embeddings for each piece and store them
  • When there is a question, compute the embedding of the question
  • Find the top K most similar embeddings to book fragments by computing distances
  • Build prompts using questions and book fragments
  • Query LLM using prompts
                  ┌────┬────┬────┬────┐                  
                  │ 1  │ 2  │ 3  │ 4  │                  
                  ├────┴────┴────┴────┤                  
                  │  Docs/Knowledge   │                  
┌───────┐         │        ...        │       ┌─────────┐
│       │         ├────┬────┬────┬────┤       │         │
│       │         │ 95 │ 96 │    │    │       │         │
│       │         └────┴────┴────┴────┘       │         │
│ User  │─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─▶   LLM   │
│       │                                     │         │
│       │                                     │         │
└───────┘    ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐  └─────────┘
    │          ┌──────────────────────────┐        ▲     
    └────────┼▶│  Tell me ....., please   │├───────┘     
               └──────────────────────────┘              
             │ ┌────┐ ┌────┐               │             
               │ 3  │ │ 96 │                             
             │ └────┘ └────┘               │             
              ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ 

Llama Index

Llama Index is an open source toolkit that helps us do in-context learning with best practices:

  • It provides various data loaders to serialize documents/knowledge in a unified format, such as PDF, Wikipedia, Notion, Twitter, etc., so that we don't need to handle preprocessing, splitting data into fragments, etc. by ourselves.
  • It also helps us create embeddings (and other forms of indexing) and store embeddings in memory or in vector databases with one line of code.
  • It has built-in hints and other engineering implementations, so we don't need to create and research from scratch, for example, "Create a chatbot on existing data with 4 lines of code" .

Problems with document segmentation and embedding

Embedding and vector search work well in many cases, but there are still challenges in some cases, such as: loss of global context / cross-node context.

Imagine the query "Please tell me about the author and foo", in this book, suppose that the sections numbered 1, 3, 6, 19-25, 30-44, and 96-99 all refer to the topic foo. In this case, then, simply searching for the top-k embeddings related to book fragments may not be effective, because at this time only the few most relevant fragments (say k = 3) are considered, and a lot of contextual information will be lost.

┌────┬────┬────┬────┐
│ 1  │ 2  │ 3  │ 4  │
├────┴────┴────┴────┤
│  Docs/Knowledge   │
│        ...        │
├────┬────┬────┬────┤
│ 95 │ 96 │    │    │
└────┴────┴────┴────┘

The way to solve and alleviate this problem is to create composite indexes and comprehensive indexes in the context of the Llama Index tool .

Among them, vector storage (VectorStore) is only a part of it. Besides that, we can define a summary index, tree index, etc. to route different types of questions to different indexes , avoiding missing the global context when it is needed.

However, with knowledge graphs, we can take a more interesting approach:

knowledge map

The term knowledge graph was first coined by Google in May 2012 as part of its practice of enhancing search results to provide users with more contextual information. Knowledge graphs aim to understand the relationships between entities and provide answers to queries directly, rather than just returning a list of related web pages.

A knowledge graph is a way to organize and connect information in the form of a graph structure, where nodes represent entities and edges represent relationships between entities. Graph structures allow users to efficiently store, retrieve, and analyze data.

Its structure is shown in the figure below:

Now the problem is coming. As mentioned above, the knowledge map can help solve the problem of document segmentation and embedding. So, how can the knowledge map help us?

Combination of embedding and knowledge graph

The basic idea here is that, as a refined format of information, the granularity of data that can be cut by the knowledge map is finer and smaller than that of our manual segmentation. Combining the small-grained data of the knowledge graph with the previously manually processed bulk data, we can better search for queries that require global/cross-node context.

Let's do a question: Please look at the diagram below, assuming that the question is xrelated to , 20 of all data fragments are xhighly related to . Now, in addition to getting the first 3 document fragments of the main context (such as document fragments numbered 1, 2 and 96), we also perform two jump queries from the knowledge graph, then the complete context will include x:

  • 问题:"Tell me things about the author and x"
  • Original document from document fragment numbers 1, 2, and 96. In the Llama Index, they are called Node 1, Node 2, and Node 96.
  • The 10 triples in the knowledge graph containing "x" xare obtained by performing a two-level depth graph traversal on :
    • x -> y (from node 1)
    • x -> a (from node 2)
    • x -> m (from node 4 )
    • x <- b-> c (from node 95 )
    • x -> d (from node 96)
    • n -> x (from node 98 )
    • x <- z <- i (from node 1 and node 3 )
    • x <- z <- b (from node 1 and node 95 )
┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐
│ .─.       .─.    │  .─.       .─.   │            .─.   │  .─.       .─.   │
│( x )─────▶ y )   │ ( x )─────▶ a )  │           ( j )  │ ( m )◀────( x )  │
│ `▲'       `─'    │  `─'       `─'   │            `─'   │  `─'       `─'   │
│  │     1         │        2         │        3    │    │        4         │
│ .─.              │                  │            .▼.   │                  │
│( z )◀────────────┼──────────────────┼───────────( i )─┐│                  │
│ `◀────┐          │                  │            `─'  ││                  │
├───────┼──────────┴──────────────────┴─────────────────┼┴──────────────────┤
│       │                      Docs/Knowledge           │                   │
│       │                            ...                │                   │
│       │                                               │                   │
├───────┼──────────┬──────────────────┬─────────────────┼┬──────────────────┤
│  .─.  └──────.   │  .─.             │                 ││  .─.             │
│ ( x ◀─────( b )  │ ( x )            │                 └┼▶( n )            │
│  `─'       `─'   │  `─'             │                  │  `─'             │
│        95   │    │   │    96        │                  │   │    98        │
│            .▼.   │  .▼.             │                  │   ▼              │
│           ( c )  │ ( d )            │                  │  .─.             │
│            `─'   │  `─'             │                  │ ( x )            │
└──────────────────┴──────────────────┴──────────────────┴──`─'─────────────┘

Obviously, those (possibly valuable) xrefined information related to the topic comes from other nodes and cross-node information, because we introduce the knowledge graph, and can be included in the prompt for context learning, thus overcoming the aforementioned problems.

Advances in Knowledge Graphs in Llama Index

Initially, William FH introduced the abstract concept of knowledge graph into Llama Index, where triples in knowledge graph are associated with keywords and stored in documents in memory, and then Logan Markewich also added the embedding of each triple.

For the last few weeks, I've been working with the Llama Index community on bringing the "GraphStore" storage context to Llama Index , which brings in external storage for knowledge graphs. The first external storage of the knowledge graph is to connect to the open source distributed graph database NebulaGraph, which has been realized with my efforts.

During the implementation, multiple hop count options for traversing the graph and the option of collecting more key entities in the top k nodes are also introduced for searching in the knowledge graph for more global context. These changes mentioned above are still in progress.

After introducing GraphStore in the large model, context learning can also be performed from the existing knowledge graph and combined with other indexes, which is also very promising. Because knowledge graphs are considered to have higher information density than other structured data.

This article serves as the beginning and describes the relationship between some knowledge graphs and LLM. In the follow-up articles, I will share with you the specific knowledge map and the application practice of LLM with a focus on practical operation.

--

Thank you for reading this article (///▽///)

Welcome to GitHub to read the source code of NebulaGraph, or try to use it to solve your business problems yo~ GitHub address: https://github.com/vesoft-inc/nebula If you want to exchange graph technology and other ideas, please go to the forum: https://discuss.nebula-graph.com.cn/

The 8 most in-demand programming languages ​​in 2023: PHP is strong, C/C++ demand is slowing Musk announced that Twitter will be renamed X, and the logo will be changed for five years, Cython 3.0 is officially released GPT-4 is getting more and more stupid? The accuracy rate dropped from 97.6% to 2.4%. MySQL 8.1 and MySQL 8.0.34 were officially released. The father of C# and TypeScript announced the latest open source project: TypeChat Meta Enlargement move: released an open source large language model Llama 2, which is free for commercial use . React core developer Dan Abramov announced his resignation from Meta. ChatGPT for Android will be launched next week. Pre-registration starts now . needs? Maybe this 5k star GitHub open source project can help - MetaGPT
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4169309/blog/10090712