HBase Basics | Introduction to HGraphDB

One, HGraphDB overview

Graphs are everywhere, social and e-commerce fields generate a lot of entity connection data every day, and the way to describe graphs is often to use attribute graphs that include vertices and edges as well as rich attributes. In today's 2018, social network and e-commerce data can often form very large entity graphs, including billions of vertices and tens of billions of edges. Faced with such a huge amount of data, traditional relational databases are often difficult to handle.

image


When it comes to why graph databases appear, this is about the problem that relational data expression capabilities are far from enough. SQL expression is very complex, often requiring multiple tables for cascading queries, while using graph data structures will be closer to the real world.


The bottom layer of HGraphDB is based on HBase for data storage, which is convenient for horizontal expansion. Secondly, HGraphDB is implemented based on Tinker pop3, so it supports the integration of a full set of Tinker pop3 software stack and Gremlin language. In addition, HGraphDB is an OLTP library that supports schema, vertex and edge addition, deletion, modification, and graph traversal.

image


HGraphDB supports low-level data access, and the data layer of HGraphDB can be accessed through files and real-time external interfaces. The underlying persistence is on HBase, and real-time recommendations and K-hop queries are realized through Gremlin pictures. It can also directly analyze the HBase database through integration with Spark to realize common graph calculations such as PageRank and China Unicom subgraph.

image


For the full stack of HGraphDB, because it is biased towards OLTP, its core lies on Gremlin-server, which contains the core driver layer of HGraphDB, which is the underlying data model and is mainly responsible for the underlying data access to HBase. And because many users often carry data and use HGraphDB products, Graph-loader can do some file format analysis, and then import the data into the HBase database in batches. Gremlin-server can also provide HTTP and Web Socket services, so the client can access the graph database through SDK, HTTP and Web Socket. At the same time, operations such as graph traversal and calculation can also be achieved through the command line.

image



Second, in-depth analysis of HGraphDB

Data Model-Global Ordered Index Table
Because HGraphDB provides an attribute graph, it includes vertices, edges and attributes, and because HBase is a Key-Value table system, it is easy to implement persistent storage of the attribute set of orders and edges . However, in the figure, one hop and two hops are relatively troublesome. For this, a global ordered index table is used to solve this problem, and all attribute sets are further found through the secondary index.

image


The following figure shows the real Schema format of the underlying table storage. It can be seen that the In and Out edges related to the vertices can be placed closely together, and all the related vertices and edges can be obtained by only one operation. data. For the Vertex table and Edge table, the RowKey is its ID, and the attributes are displayed in a wide column or wide table.

image


Gremlin is a SQL-like dialect, as shown in the figure below is Gremlin's DSL, which is a domain-specific language, and Tinker pop includes the Gremlin interpreter.

image


The execution plan of the Gremlin execution engine is shown below. The three steps will form a Pipeline, using lazy iteration, and the bottom layer will access the HBase data table, and its execution efficiency is very high.

image


Three, HGraphDB graph analysis

Only the OLTP capability of the graph database is far from enough. Graph analysis is also needed. This is achieved through Spark graphframes. Spark graphframes can implement common analysis, such as filter/agg calculations, count and sum, etc. These calculations can be converted into Graphx-based graph calculations at the bottom layer, and then pagerank and shortest path can be realized. Implement recommendations, minimum spanning tree, Unicom component and kcore calculations. Spark graphframes can not only meet the needs of general summary statistics, but also meet the needs of iterative computing related to testing and machine learning.


HGraphDB supports GAS decomposition, a programming framework for graph computing. GAS decomposition is similar to the idea of ​​MapReduce. It is mainly a vertex-oriented programming idea, which includes a calculation engine, so you only need to write what you want to achieve. That is to disassemble the program into three parts: Gather, Apply and Scatter. There are many partitions in HGraphDB, and a vertex will have multiple different Partitions. In each Partition, the calculation is performed first, and then the calculated values ​​are summarized in the Master, so the first step is called Gather. The second step is called Apply, which is to change the value of the Master node. When the value is modified, the Scatter step will distribute the current value to the mirror vertex, update the value of the mirror vertex, and notify the adjacent vertex to recalculate.

image


The following figure shows an example of graph analysis calculation using Spark graphframes.

image


Currently, the HGraphDB product has been opened to Alibaba Cloud. In the process of implementing HGraphDB, different technical selections were also made, such as comparison and evaluation with the more popular janusgraph. The reason for choosing to be based on HGraphDB is that its code size is relatively small and its functions are relatively clear. It can be rewritten and secondary development based on it. However, janusgraph may not be able to import data or its efficiency is extremely low when doing data import. At times, performance will drop sharply. HGraphDB supports user-specified ID, but janusgraph cannot. For data import, janusgraph is difficult to import side data, and the efficiency is extremely low. Since janusgraph needs to support all the underlying table storage systems, HGraphDB is directly optimized based on HBase. In addition, janusgraph has made a layer of abstraction and treats HBase as a black box, so the performance is not good.

image

Fourth, use scenarios and future work

Main use scenario
User 360: Because HGraphDB provides an attribute graph, it is easy to get a subgraph of user attributes.
Personalized recommendation: It is easy to realize personalized recommendation based on graphs.
Fraud detection: You can manually mark some black vertices. If they are too close to this vertex, there may be problems.


Future direction of improvement
HGraphDB is a standard implementation of Gremlin’s DSL, and Gremlin provides many hooks to allow users to optimize the actuator themselves, but the current HGraphDB currently transmits a large amount of data and its own computing power is limited, so The full table scan is more laborious, so there is more room for optimization on the actuator. In addition, operators such as word aggregation can be pushed down, and graph analysis capabilities can be enhanced, using Spark as an embedded analysis engine to implement Tinker pop OLAP specifications.


Feelings: In 2018, social network and e-commerce data can often form very large entity graphs. Faced with such a huge amount of data, traditional relational databases are often difficult to handle well, so graph databases are needed to help solve them. . The bottom layer of HGraphDB is implemented based on HBase, supports Tinker pop3's full set of software stacks and Gremlin language, and supports OLTP. It is an excellent tool for solving graph computing problems.



image


Guess you like

Origin blog.51cto.com/15060465/2677037