Ultipa | Learn a little bit about graph databases in one article

This article includes the following content points:
· Main technical classifications of databases
· What is a graph?
· Graph pattern
· Graph database VS. Relational database
· Graph database VS. Comparison of other NOSQL
· Not all graph databases are the same!

According to Gartner predictions, “By 2025, the proportion of data and analysis innovation using graph technology will increase from 10% in 2021 to 80%, which will greatly promote rapid decision-making by enterprises.”

The picture above is the author's simple classification of the main technical tracks of the database. You can also combine it with my previous article [Ying Figure | One article to understand what are the milestones in the development of the database? - Zhihu (zhihu.com) ] Let’s watch together. It is not difficult to find that in the more than 50 years of development history of database technology, its vigorous vitality and the challenges posed by new databases to traditional databases - the driving factors behind this are mainly the urgent need for a new and efficient database in industry and academia. , flexible, high-dimensional architecture emerges to meet the rapid growth of data volume (Volume), the diversity of data types (Volume), the rapid increase in data generation speed (Volume), and people's attention to the value of data (Volume) ——We can therefore understand why GQL has become the only international standard after SQL since 1983. This also strongly illustrates the influence and importance of graph database technology on the future.

Figure: Database engine types ranked by data complexity

Table 1: Analysis of 5 types of mainstream database products

Classification	performance	Scalability	flexibility	Complexity
Key value storage database	high	high	high	none
document database	high	variable	high	Low
Column store database	high	variable	generally	Low
graph database	variable	high	high	high
Relational Database	variable	variable	Low	generally

Graph Database is a NoSQL database implemented based on graph theory. It can store attribute information of entities and relationship information between entities. It has simple modeling, strong performance, rich search functions, and strong scalability.

A graph is composed of vertices and edges connecting each pair of vertices:

Point (node): It is called vertex (Vertex) or point (node), and can also be called entity (Entity).

Edge: An edge connecting two points (node). It is also often called a relationship (relation, relationship) in the category of knowledge graph.

For example, when you look up the relationship between Leonardo da Vinci and the Louvre, you can correlate a very simple diagram between people and objects - the "six degrees of separation theory" originated from the diagram.

Renaissance (representative figure) - Leonardo da Vinci - (representative work) Mona Lisa - (collection) Louvre Museum - Pyramid glass entrance (architect) I.M. Pei - Francois I ( Collection)—Mona Lisa (painting)—Renaissance (influence)]

Another example is the subway, a commuting tool we use almost every day in our daily lives. If a station is regarded as a "point" and two adjacent stations are connected by "edges", then this can also be connected into a typical "graph".

We can extend infinitely with our own thinking, and by connecting nodes with nodes, we can directly construct attributes and relationships in the real world through graph data at our fingertips [For more reading, see the library | What is a graph ? 】.

There are three types of graph modes, namely attribute graph, hypergraph and triplet, because graph data needs to be stored in a specific graph database before it can be finally implemented into a specific data file, and this process naturally involves What implementation method is used to save graph data. Take Ultipa Graph as an example. Like Nejo4, it is a property graph (Property Graphs) - the property graph model is easier to understand and can describe most graph usage scenarios.

Why are the advantages of graph databases becoming more and more prominent? For example, in traditional relational databases, once multi-table correlation queries are involved, the amount of calculation increases in proportion to the Cartesian product of the amount of data in the table. The larger the amount of data, the more table correlations, the more complex, and the lower the efficiency. Because it searches for matching primary key records in the main table through foreign keys to perform search and matching calculation operations. If a many-to-many relationship is used, an intermediate table must be added to save the foreign key correspondence between the two participating tables. relationship, which further increases the cost of the join operation.

The graph database is very flexible. Not only can it succinctly show the relationship between fabric data through points and edges, but the calculation logic used is the nearest neighbor association calculation (query) mode, which has low computational complexity and exponentially improved efficiency. . See picture below.

Figure: Architectural differences between graph databases and relational databases

For example, if you use a relational database and a graph database to do a deep penetration, from layers 2 to 5, the performance difference actually increases exponentially. For example, when doing layer 1 penetration, there may be no essential difference between the two. Starting from layer 2, there will be exponential (more than 10 times) changes. It can no longer return any results, that is to say, it has exceeded the computing scope of the machine and has stopped. (Interested readers can read in detail: The difference between graph databases and relational databases⁴ ).

Judging from the current market share of major types of databases, relational databases are still the mainstream, but this is in the context of the past when there was a lack of alternatives. With more and more scenarios where it cannot hold, graph databases are Its natural genetic advantages will become a weapon for overtaking in corners.

Table 2: Comparison of mainstream graph databases

	Nejo4j	JanusGraph	Ultipa Graph
Reputation	Highest	high	generally
Open source ecosystem	The community version is open source, but has more restrictions; the commercial version is closed source:	Open source; compatible with Apache Tinkerpop ecosystem, cloud services mainly provided by AWS and IBM	Closed source cloud services are mainly provided by Ultipa Cloud
graph query language	Cypher	Gremlin	UQL
Support data scale	The community version is rated at one billion levels; the enterprise version is rated at over 100 billion levels	Ten billion level or above	More than 100 billion level
Large-scale data writing performance	Online import is slow	slower	Fast online import
Large-scale data query performance	Fast and more stable	faster	Fast and super stable
Functional perfection	Complete	Complete	Complete
Data import tool	Support CSV online import; support rich formats	No support provided	Ultipa Transporter supports running on all platforms, supports a variety of formats, and provides data import capabilities for files such as TSV, CSV, Mysql, BigQuery and CSV export capabilities.
Visual interface	Support, rich functions, support visual data modeling, import, analysis, etc.	Not supported, the user needs to integrate a third-party interface	Support, rich in functions, support 2D and 3D conversion; support visual data modeling, import, analysis, etc.
Built-in commonly used graph algorithms	Provides installation algorithm package, providing a wealth of basic graph algorithms	not support	It provides installation algorithm packages and has a rich algorithm library, which can be provided to users in the form of independent algorithm packages.
Basic functions (add, delete, check and modify attribute graphs, plan maintenance, metadata, transactions, caching, query optimization, incremental update of graphs, etc.)	support	support	support
ACID transactions	support	Partially supported, depending on backend storage.	support
chema constraints	Commercial version support, also supports Schema-Free	Supported, and also supports Schema-Free	Supported, and also supports Schema-Free
Graph storage type	Support local storage, support distributed storage, support cloud managed storage	Fly local storage, support distributed storage
graph partition	support	support	support
High availability HA	Business version support	No support provided	support

As we know from the above, the reason why NoSQL databases have become popular is that they can solve the challenges of most data types, large-scale data collections, etc., but what are the differences between them (simply talk about key-value pairs and documents) What about comparison?

Document storage is a hierarchical structure, and data can be easily stored as a tree structure. However, because of this, it can only express a top-to-bottom subordinate relationship, and tree shape is just one of them in graph databases. The performance is richer. In addition, the tree storage structure will have redundant data embedded multiple times, which will increase the difficulty of updating data and fail to ensure data consistency.

Key-value database is more suitable for applications with a small amount of data relationships, because it is organized, indexed and stored in the form of key-value pairs. When the amount of data is small, it can effectively reduce the number of reads and writes to the disk and has high performance. But on the contrary, once the amount of data is large, the obvious graph can better express the complex relationship between the data.

Finally, it’s important to note that not all graph databases are created equal! Some graph databases only have storage capabilities but lack computing capabilities, while others can perform calculations, but are very inefficient when it comes to data migration. There are also some graph databases implemented using NoSQL or MapReduce architectures, but they have not fully and deeply optimized the characteristics of graph computing. The final effect is that the more horizontal the distribution, the lower the efficiency. Some manufacturers blindly move all data into the memory, resulting in a sudden increase in memory usage, which also creates the negative problem of frequent OOM and resulting in downtime. The correct implementation path is "distribution + storage and computing integration + multi-level storage optimization + graph query depth optimization". There are many knowledge points and challenges involved in graph databases, about how to design and implement a truly high-performance distributed graph. Database, interested readers can refer to How to implement a high-concurrency graph database system? ³. [Text/Nezha Emma]

Ultipa | Learn a little bit about graph databases in one article

Guess you like