Technology News | LangChain has added Cypher Search

Reprint public number | RowanFYI


By using the LangChain library, you can easily generate Cypher queries to efficiently retrieve information from Neo4j.

If you have developed or plan to implement any solution that uses large language models, chances are you have heard of the LangChain library. The LangChain library is the most widely known Python library for developing applications using LLM with various functions. It is designed to be modular, allowing us to use any LLM in the available modules, such as chained structures, tools, memory or agents.

A month ago, I spent a week researching and implementing a solution that would allow anyone to retrieve Neo4j's information directly from the LangChain library and use it in their LLM application. I learned a lot about the internals of the LangChain library and documented my experience in a blog post.

A colleague of mine showed me a feature request for LangChain where the user requested that my work of retrieving information from a Neo4j database be added directly to the LangChain library as a module so that no extra code or external modules were required to integrate Neo4j to the LangChain application. Since I am already familiar with the internal structure of LangChain, I decided to try to implement the Cypher search function by myself. I spent a weekend researching and coding the solution and making sure it met the contribution criteria so it could be added to the library. Fortunately, the maintainers of LangChain are very active and open, and Cypher search has been added in the latest version of the LangChain library. Thanks A big thank you for maintaining this excellent library and being very responsive to new ideas.

In this blog post, I will show you how to retrieve information from a Neo4j database using the newly added Cypher search functionality in the LangChain library.

What is a Knowledge Graph

LangChain is already integrated with Vector and SQL databases, so why do we need to integrate with graph databases like Neo4j?

d7ed48ba84b61dd64e0f9081dfa29d07.png

Knowledge graphs are ideal for storing heterogeneous and highly connected data. For example, the image above contains information about people, organizations, movies, websites, etc. While the ability to visually model and store various data sets is incredible, I think the main benefit of using graphs is the ability to analyze data points through the relationships between them. Graphs allow us to discover connections and correlations in context around individual data points that traditional database and analytics methods tend to overlook.

The power of graph databases really shines when dealing with complex systems, where interdependencies and interactions are critical to understanding the system.

They allow us to go beyond individual data points and delve deeper into the complex relationships that define their context. This provides a deeper, more comprehensive view of the data, facilitating better decision-making and knowledge discovery.

Build Neo4j environment

If you already have an existing Neo4j database, you can use it to try out the newly added Cypher search functionality. The Cypher search module utilizes graph schema information to generate Cypher statements, which means you can insert them into any Neo4j database.

If you don't have any Neo4j database yet, you can use Neo4j Sandbox, which provides a free cloud instance of a Neo4j database. You need to register and instantiate any available prepopulated databases. In this blog post, I will be using the ICIJ Paradise Papers dataset, but you can use any other dataset. The dataset was provided by the International Investigative Journalists Collaboration as part of its Offshore Leaks Database.

6ff05332adf63b6f5e74f8894d05abf8.png

The graph contains four types of nodes:

  • **Entity** - Offshore legal entities. This can be a company, trust, foundation or other legal entity created in a low tax jurisdiction.

  • **Officer** - An individual or company that plays a role in the offshore entity, such as beneficiary, director or shareholder. The relationships shown in the diagram are a sample of existing relationships.

  • **Intermediary** - An intermediary – usually a law firm or an intermediary – that seeks to connect an offshore company with an offshore service provider who will ask the offshore service provider to create an offshore company.

  • **Address** - Registered address as shown in the original database obtained by ICIJ.

Knowledge Graph Cypher Search

The name Cypher Search comes from Cypher, which is a query language for interacting with graph databases such as Neo4j.

dac85f8a59159acf75ab3a3cc5e3dc16.png

Knowledge graph Cypher chain workflow

In order for LangChain to retrieve information from the graph database, I implemented a module that can convert natural language into Cypher statements, use it to retrieve data from Neo4j, and return the retrieved information to the user in natural language. This two-way conversion process between natural language and database language not only enhances the overall accessibility of data retrieval, but also greatly improves user experience.

The beauty of the LangChain library is its simplicity. We only need a few lines of code to get information from Neo4j using natural language.

from langchain.chat_models import ChatOpenAI
from langchain.chains import GraphCypherQAChain
from langchain.graphs import Neo4jGraph
graph = Neo4jGraph(
    url="bolt://54.172.172.36:7687",
    username="neo4j",
    password="steel-foreheads-examples")
chain = GraphCypherQAChain.from_llm(
    ChatOpenAI(temperature=0), 
    graph=graph, verbose=True,)

Here, we use OpenAI's gpt-3.5-turbo model to generate Cypher sentences. These Cypher statements are generated based on a graph schema, which means you can theoretically plug a Cypher chain into any Neo4j instance and it should be able to answer natural language questions. Unfortunately, I have not tested the ability of other LLM providers to generate Cypher statements, as I do not have access to any of them. However, if you're willing to give it a try, I'd love to hear your opinion on other LLMs generating Cypher statements. Of course, if you want to get rid of the dependence on the LLM cloud service provider, you can also generate Cypher statements by fine-tuning the open source LLM.

Let's start with a simple test.

chain.run("""Which intermediary is connected to most entites?""")

result

b4994a61940266dd40e9026506a11938.png

generated answer

We can observe the Cypher statements generated and the information retrieved from Neo4j used to form the answer. This is a very simple setup. Let's move on to the next example.

chain.run("""Who are the officers of ZZZ-MILI COMPANY LTD.?""")

result

814a634852a4a3a25a53c7593b240082.png

generated answer

Now that we're working with graphs, let's frame a problem that takes full advantage of graph databases.

chain.run("""How are entities SOUTHWEST LAND DEVELOPMENT LTD. and Dragon Capital Markets Limited related?""")

At first glance, the generated Cypher statements seem to be fine. However, the problem is that Cypher statements use variable-length path lookup syntax and treat relationships as undirected. As a result, this type of query is very unoptimized and can cause an explosion in row counts.

The nice thing about gpt-3.5-turbo is its ability to act on the prompts and instructions we provide in the input. For example, we can ask it to only find the shortest paths.

chain.run("""How are entities SOUTHWEST LAND DEVELOPMENT LTD. and Dragon Capital Markets Limited connected?Find a shortest path.""")

072a38cdbf36556b39e12b90c125a1ef.png

generated answer

Now we imply that only the shortest path should be retrieved so that we don't run into the problem of cardinality explosion anymore. However, I noticed an issue where LLM sometimes does not give the best results if a path object is returned. The generated Cypher statement returns the following visualization in the Neo4j browser.

eaff8d384d5b00a3ddc3ae88326c33ec.png

graphic visualization

The generated natural language response doesn't really mention that the two companies are registered at the same address, but generates the shortest path itself based on the node attributes. However, we can also fix this by instructing the model which information to use.

chain.run("""How are entities SOUTHWEST LAND DEVELOPMENT LTD. and Dragon Capital Markets Limited connected?Find a shortest path.Return only name properties of nodes and relationship types""")

result

b78140e354235467ec4785618e720d1d.png

generated answer

Now we can get better responses and more appropriate responses. The more leads you give LLM, the better results you can expect. For example, you can also indicate which relationships it can traverse.

chain.run("""How are entities SOUTHWEST LAND DEVELOPMENT LTD. and Dragon Capital Markets Limited connected?Find a shortest path and use only officer, intermediary, and connected relationships.Return only name properties of nodes and relationship types""")

result

9006df48892f1502d6fd1756085cd505.png

generated answer

You can see that the generated Cypher statement only allows traversal of OFFICER_OF, INTERMEDIARY_OF and CONNECTED_TO relations. The same Cypher statement produces the following graph visualization.

5c21b978f35b1c31ae3195825596339f.png

graphic visualization

Summary

Graph databases are great tools for retrieving or analyzing connections between various entities such as people and organizations. In this blog post, we looked at a simple shortest path use case where the number of relationships and the order of relationship types are not known in advance. These types of queries are nearly impossible in vector databases, and can be very complex in SQL databases as well.

I'm very excited about adding Cypher Search to the LangChain library. Please test it out and let me know how it works for you, especially if you test it on other LLM models or have exciting use cases. Also, remember to stay tuned as I have prepared some blog posts that will explore the Cypher Search functionality in the LangChain library.


OpenKG

OpenKG (Chinese Open Knowledge Graph) aims to promote the openness, interconnection and crowdsourcing of knowledge graph data with Chinese as the core, and promote the open source and open source of knowledge graph algorithms, tools and platforms.

3a6d99e9a662ca0714a8f2e05775e95b.png

Click to read the original text and enter the OpenKG website.

Guess you like

Origin blog.csdn.net/TgqDT3gGaMdkHasLZv/article/details/131496282