Graph database knowledge point 8: Why are the graph databases you encounter unreliable?

Be clear at the beginning and stick to the theme.

The question we want to discuss today is: Why are the graph databases you (me, him, everyone) encountered unreliable? Not as powerful as advertised?

This problem has two levels:

  1. expected gap
  2. Seek benevolence and gain benevolence

Regarding the gap in expectations - you want the graph database to be "fast and economical" - specifically, it means that it can store unlimited amounts of data, be incredibly fast, be easy to use, and not cost much. No need for me to talk nonsense, do you think such a beautiful thing really exists?

If all you think about is free open source (traditional, relational) databases, aren't there a lot of them, such as MySQL and PostgreSQL, aren't they quite powerful?

Then we have to talk about the history of database development. When did MySQL and PostgreSQL appear? Around 2000, it feels like a long time ago. But the earliest relational database appeared in the 1970s. Oracle (predecessor) is a company established in 1972. It took 30 years of development before there was an open source database. Before that, the commercial version was the only option.

On the other hand, any new technology, especially disruptive technology products, must have sufficient profit margins in its early stages to be meaningful (new leading and advanced technologies require charge premium, otherwise who will innovate?) , otherwise it will be a market that is severely involved, and in the end no one can make money. This is the current situation of China's technology industry today - there are very few innovations, and they are all using open source for free and custom development. Guess who makes the money in the end? Who owns the core technology?

If a new technology truly had core value and could create commercial value, its creator would have long ago commercialized and sold it to make money instead of building an open source community. Imagine a thing. You don’t know how to commercialize it yourself. Can the open source community help you commercialize it? A joke, a big joke. This is something that goes against human nature. Especially in the fields of low-level, hard-core, and deep technology, all those who take the open source route have one of two possibilities (or both): 1. A scam; 2. There are huge hidden dangers.

Okay, you will retort, Linux (and Linus) is so powerful, what are the hidden dangers?

Haha, Linus is the biggest hidden danger. Can you imagine that the Linux kernel of the world's largest open source community is currently being reviewed by Linus himself. Is he the biggest bottleneck? If he dies, how will the Linux project develop in the future, haha.

To put it another way, MongoDB is the most successful NoSQL database, what’s the problem?

Yes, it is still at a huge loss today. Isn't this a hidden danger? The so-called Internet thinking is essentially a business scam.

Seek benevolence and gain benevolence: There is no need to explain this - there is no such thing as being faster, easier and more economical. This is the same logic as dreaming about a pie falling from the sky every day, and the probability is about the same as an asteroid falling from the sky. As Party A, I think about paying Party B for free every day, squeezing Party B, and exploiting Party B. I have no social responsibility. In this business environment, there is no need for innovation.

Next, let’s go back to the question itself: Why does the graph database you encountered fail?

We analyze it from two dimensions: product and technology.

If you look carefully at the characteristics of the products on the market today that are called graph databases, it doesn’t matter whether they are open source or not. Let’s talk about their product features and give a few examples:

  • Query language: Cypher and Gremlin are supported. From this point of view, we know that this system is "four different" - this is also a "first" for China's graph database, which supports multiple query languages, and several companies claim to have this capability. It seems to be very compatible, but in fact, you can also understand that its bottom layer has no features at all. Precisely because it does not have a self-developed bottom layer, it does not matter what query language it supports on the upper layer, anyone can do it. In other words, the performance of this database must be very poor.
  • How to balance the "fish and cake" of massive data storage and real-time deep drilling: This problem is actually a world-class problem, but it is "easily solved" in front of the Chinese "graph database" company. Let’s first review the GPU+CPU supercomputing platform released by NVidia’s Lao Huang a few weeks ago. If you still have an impression, its cascade architecture is essentially an HPC supercomputing architecture. In short, it is through stacking terabytes of memory (yes, their memory is more than the massive external memory of most companies), and the concurrency of GPU and CPU (the concurrency of GPU is basically more than 10x that of CPU, However, both have their own emphasis. The GPU is more concurrent, but it cannot complete the general mathematical operations that the CPU is good at). To be precise, Lao Huang's architecture is essentially the same as the miniframe from half a century ago! Do you question whether NVidia does not understand the open source big data distributed framework? Do you think they haven't studied this issue? At present, all the so-called massive distributed storage architectures of graph database manufacturers can only store, but cannot calculate. The calculations mentioned here are related calculations and analysis! Whether it can be stored or not can be calculated refers to: distributed storage and centralized computing. There is a huge trap here. There must be a data migration process or a data mapping process between stored data and calculation, which is somewhat similar to Apache Spark's ETL - that is, after the data is placed on disk, you want to To run certain graph queries or graph algorithms, the data needs to be migrated or mapped somewhere (perhaps a server with large memory). This process depends on the amount of data and may be very time-consuming - the funniest thing is, The problem with this architecture is that on the surface, the data is entered into the graph (persistent), but when it is calculated, it becomes equivalent to centralized, single-machine processing, and this process cannot be "calculated as it is stored" at all. We can discuss this issue for three days and three nights, but what we can summarize now is: all distributed storage and centralized computing will have great scene limitations - it cannot handle real-time computing requests and is only suitable for some batch processing scenarios. ——So back to the meaning of the graph database, isn’t it meant to accelerate batch processing of data warehouses and relational databases to real-time? Co-authoring or batch processing? Therefore, this kind of structure you encounter will definitely make you disappointed.
  • Is it precise calculation or sampling calculation? This question is quite broad. In the picture, both modes should be present. For example, some graph algorithms were originally designed to solve problems that were previously impossible to fully calculate through "estimation, approximate calculation, and sampling calculation." For example, when calculating the global average distance, the computational complexity is generally NP-Hard. Haha, on a large graph (such as tens of millions or billions of points and edges), you can think that the existing hardware and software cannot be accurate. Measurement. Therefore, with  the HyperANP algorithm , its core is the ANF algorithm (Approximate Neighborhood Function), which can quickly estimate the average distance of the graph through sampling on any large graph, which is very practical. There are also some graph queries that must be precise and exhaustive, such as K-neighbor query. In a wide range of scenarios such as finance, medical care, and supply chains, it is necessary to determine the full number of neighbors of a vertex of 3 degrees or 6 degrees or deeper. The amount of data must be accurately calculated. However, if you look at domestic manufacturers, some of them have to use sampling calculations because they cannot measure accurately (not enough computing power, problems with data structure, problems with architecture). What is the significance of this kind of sampling? It's completely putting the cart before the horse.

Let’s evaluate the technical issues again. The number of domestic graph database manufacturers is 2-3 times more than that of overseas ones. It is said that there are 40 of them today, which is comparable to large model manufacturers. Anyway, as long as it is a new technology Trends, for Chinese companies, there is no threshold, at least that is what they claim. In fact, these manufacturers have several ways of building blocks to build projects:

  1. Get the community version of Neo4j in one package. For example, a certain graph database, forget it, let’s not name it. Let me reiterate here, Neo4j has never been an open source graph database. Not a single line of its core code, such as the graph engine, has been open source! Why is it called the community version? The core source code has never been leaked - all those who use Neo4j for free have never read or changed a single line of underlying code. So let me ask you, why do you have the nerve to say that you have subverted Neo4j? What you probably did was change the icon and customize the front-end CSS.
  2. Use JanusGraph and ArangoDB to make magic changes - the former is a typical architecture built with open source components. It can be saved but not counted. The latter is a twisted multi-mode database that seems to be disappearing in the past two years. It has obvious logic and implementation errors in some graph query scenarios, such as querying the shortest path between two points. It has an acceleration effect. mode (wrong implementation), only one shortest path was found through matrix operations. This is caused by a wrong understanding of the BFS (Breadth First) algorithm - let me ask, if there are more than 1,000 shortest paths between two points, you Just check one thing, would you mind? HD's Xu Dada and BGY's Yang Dada, if there are more than 1,000 shareholding related paths between the two, you will let the supervision see one, you will be ruthless.
  3. Graph algorithms are one of the core capabilities of graph databases. To measure a graph database, it depends on its support for graph algorithms, such as algorithm richness, efficiency (speed), resource consumption, whether it is scalable, ease of use, whether the documentation is sound, whether it is hot-pluggable, etc. However, nothing can stop the "micro-innovation" of Chinese companies - the micro-innovation of appropriationism. There is a manufacturer that does not have a single graph algorithm document on its official website, but claims to support various algorithms. It is so willful.

Okay, I’ve said a lot. To paraphrase a sentence Steve Jobs was asked when he was young, it is very classic. He was asked what determines the characteristics of the products you design. After thinking for a few seconds, he said: My taste. This is what I want to share with you. Your taste determines what kind of graph database you encounter.

If you have any unclear questions and need to ask, the most tasteful answers are in this series.

If you don’t find the answer you’re looking for in this series, it’s in our minds.

May the best graph be with you, with love from Ultipa.

Guess you like

Origin blog.csdn.net/Ultipa/article/details/132278750