What are the challenges of graph databases?

No matter what database it is, if performance is not highlighted as the primary productivity, then what is the need to continue to understand it in depth? This is especially true for graph databases - because the main challenge solved by graph databases is the exponential performance degradation and time-consuming increase of traditional databases when faced with the correlation between deep data.

We won’t go into details here about why a graph database is needed - in short, its emergence, development and continuous iteration are the inevitable trend for enterprise IT informatization to become intelligent, and it is also the transition from SQL to NoSQL to Graph & Deep Data inevitable.

However, not all graph databases are alike, and currently we have seen several different camps:

If distinguished according to the four magic quadrants, they will be distributed in each quadrant as shown in the figure above. Obviously, the first quadrant highlights two characteristics:

1. High performance;

2. Native image support.

Some manufacturers think native images are a gimmick, which just shows that the developers of these manufacturers are vague about the concept of native images - the implicit meaning of native images is native image storage, or high-performance storage engine support, which can also be said to be close neighbors. Construction of index-free storage architecture - Just imagine, if a graph computing engine still relies on a relational database or other NoSQL database (such as JanusGraph) for its underlying storage , can you expect it to have high concurrency and low latency capabilities? ?

Back to performance, we need to be clear that many systems claim to have millisecond response, which sounds fast. But what the author wants to point out is that modern CPUs work at the nanosecond level, and a millisecond is already a million nanosecond operations. Moreover, the millisecond response claimed by some graph database systems refers to the most basic operations (such as searching for vertices and edges, or up to 1-hop path or K-neighbor operations). Isn’t this simple operation ? Should the response be at the subtle level?

Let’s take a look at the performance comparison between Ultipa real-time graph database and other systems in a production environment in a real financial scenario:

For example, 1000 times of point and edge queries:

Note that what is shown in the above figure is 1000 query operations for vertices and edges. Ultipa's average query time is 110-140 microseconds (~0.1 milliseconds) - about 10 times faster than Neo4j.

However, the figure above shows the performance comparison of Ultipa vs. Neo4j vs. ArangoDB in 1-3-5-10 layer (degree) K-neighbor and path queries, as well as with filter conditions. Note that on the right side of the above figure, when performing a path query with a depth of 5 steps , Ultipa's average delay is 2ms, while other systems are thousands or even ten thousand times slower than Ultipa!

The test data sets in the above two screenshots are Alimama's public data set (100 million edges) and national industrial and commercial data (300 million edge sizes and attributes).

The following figure shows the test results on the Twitter data set that graph database companies may encounter:

A characteristic of the Twitter desensitization data set is that there are some typical super nodes, 42M vertices, ~1.5 billion edges, and some vertices with an out-degree of more than 1 million (so-called KOL big V). The Twitter data set is only unevenly distributed in topological structure, but the overall correlation between the data is very high - however, it does not have any point or edge attributes, so the benchmark test for Twitter can only be regarded as a kind of violent calculation ability. , there is still a big difference from the needs in real industrialization (financial industry) scenarios (the gap is more accurate - you can understand it as, when your system is to better support real business scenarios, such as data, When things are consistent, your system may have to sacrifice some performance factors - which is why many systems are getting slower and slower - as your core code becomes more and more bloated, what can truly maintain ultimate performance? systems, very few).

Regarding high-concurrency systems, there is another point that many people have not understood. They think that any distributed system is highly concurrent. This understanding is very inaccurate. There are two main points for high concurrency:

High concurrency on the server side;
High concurrency on the client side.

How to understand the above two sentences? One is how many users your system can support accessing your graph database (client) at the same time, and the other is how you achieve high concurrency when each request reaches the server of your graph database - this place is the most tricky - for example For a 5-hop computing request, the highly concurrent server will dynamically match parallel computing resources based on data characteristics and request characteristics to use BFS (Breadth First Search) method, combined with the nearest neighbor index-free data structure, at each atomic level. The operation is O(1) algorithm complexity. It starts from the original vertex and recursively operates 5 layers and returns all unique (deduplication) vertex sets that are 5 steps away from the starting vertex on the shortest path - if the CPU utilization is observed in real time during this process The rate may reach the physical limit (if no concurrency scale limit is imposed) - on the Twitter data set, this feature will be easily reflected. A request, such as a 6-hop calculation for vertex 12, a single 16-core It may be possible to reach 3200% on the CPU... Then here comes the problem: except for Ultipa, we have not found any other system that can have such high-density concurrency capabilities! This capability does not mean that if you have 32 instances in your distributed cluster, and each instance has only one thread running, you can complete this 6-hop operation! The efficiency of each machine with 32 machines*1 thread is far (exponentially lower) than the concurrency of 32 threads on 1 machine. Because the data in the graph database is highly correlated, when you violently split the graph, your performance when facing deep correlation queries must be extremely bad!

Able to achieve high concurrency of a single instance, and then start iterative evolution to the ultra-high concurrency of a multi-instance distributed system. What many Internet manufacturers like most is to claim that their systems are horizontally distributed. However, as of today, their systems have neither high-density concurrency nor low latency. The essence of this problem is that all high-performance The bottom layer of the system must have done a lot of work in storage and computing (as well as the network layer), and a lot of performance optimization and function adaptation have been done to face the characteristics of the graph data set - this challenge and the flash sales that have been done for several years Single is not the same thing at all. The challenge is to solve the storage and computing logic of traditional databases from the bottom up - SQL is storage-oriented, while graph databases are computing-oriented. Confusing the two will lead to many detours.

There is still a lot of content worth discussing in graph databases, and I hope everyone can communicate more.

What are the challenges of graph databases?

Guess you like