[Dry goods] Huawei cloud database GES technology evolution

This article is shared from Huawei Cloud Community " [Dry Goods] Huawei Cloud Map Database GES Technology Evolution ", author: Chenyi.

1 background

Large-scale graph data is ubiquitous, and graph query, analysis, and representation learning have become one of the core parts of big data and AI. Especially with the development of knowledge graph and graph neural network, Graph has become the foundation of future AI.

Various graph data

 

Facing the future, graph databases face new challenges in terms of data scale, multidimensional relationships, spatiotemporal dynamics, and heterogeneous computing :

1. The scale of graph data continues to grow, and ultra-large-scale graphs with trillions of edges are ubiquitous, which puts forward new requirements for product performance and scalability.

2. Spatio-temporal graphs, heterogeneous and multi-relational graphs are becoming more and more common in the fields of government affairs, security, finance, knowledge graphs, etc., bringing new demands on product graph data models and storage.

3. The rise of graph representation learning such as graph neural networks requires the support of new computing frameworks, which brings new opportunities for the integration of traditional deep learning frameworks and graph computing frameworks.

4. The heterogeneous computing system of GPU, FPGA and graph accelerator brings new demands and opportunities for the graph engine.

Graph is an important part of the big data analysis platform, providing more advanced analysis capabilities in addition to traditional batch flow analysis; it is mainly divided into two major capabilities: graph database and graph computing engine:

  • Graph databases have graph storage and computing capabilities, support transactions, data updates, and query languages, and are partial to TP scenarios. They are used in scenarios with high real-time requirements and relatively simple logic. For example: find the shortest path between two merchants; find the transfer path of a suspected money laundering card.
  • The graph calculation engine focuses on complex queries and global calculations, uses graph analysis algorithms, and is partial to AP scenarios. It is used in scenarios with low real-time requirements and large amounts of data. Example: Generate a cardholder relationship network, and output cash-out cards in batches according to the cash-out model.

Generally speaking, the core capabilities of Graph are: "deep relationship mining", "efficient relationship query", "efficient community analysis", and "visual display of paths" .

An example of using Graph to analyze [epidemic spread]

1.1 Trend 1: In the face of massive and diverse data, data analysis becomes more complex, and graph-related technologies are rapidly popularized

Gartner has mentioned the importance of graph technology in several analyst reports:

  • Gartner lists Graph and related technologies as one of the top 10 trends related to data and analysis technologies in 2021.
  • Gartner predicts that by 2025, the usage rate of graph-related technologies will increase from 10% (2021) to 80%.
  • "By 2023, graph computing will promote rapid decision-making scenarios for 30% of global companies. Need a graph or not? This is no longer a question, it must be a need."

1.2 Trend 2: Various query languages ​​are different, hindering the popularization of graph databases, and GQL is expected to become a unified language

Historically, there was no standard query language for graph databases. There were only de facto standards such as Cypher and Gremlin (that is, the query language used by a wide range of products), and new products continued to derive their own query languages. The inconsistent syntax made the threshold of use The increase has caused adverse effects on the popularization of this field.

GQL is dominated by WG3 (WG3 has been responsible for the formulation of SQL standards since 1987). GQL will be built on top of openCypher Morpheus (which brings Cypher to Apache Spark), combined with inspiration from LDBC's G-CORE, to provide users with a composite graph query language supporting all those features, which will make GQL Conceptually equivalent to SQL.

2 Technical Insights

2.1 Technical Analysis of Graph Database Mainstream Systems

The above table lists the analysis of mainstream graph database systems. Our point of view is:

  • Compared with relational databases, graph databases lag behind in development (lack of multi-tenant and cloud-native capabilities, and generally insufficient query optimization capabilities)
  • Mainstream graph databases with high concurrency and poor mixed load
    • Does not support performance isolation of mixed load queries under high concurrency
    • Does not support multi-query query QoS (Quality of Service)
    • There is almost no targeted optimization for mixed loads (mixed size graph query)
  • Mainstream graph databases hardly provide any optimization for fusion data analysis (mixed query types)
    • The vast majority of systems do not have fusion analysis capabilities
    • A small number of systems have specific primary fusion analysis capabilities, but do not have the overall optimization for fusion queries
  • Mainstream graph databases have poor support for cloud native
    • Only AWS Neptune is optimized for cloud-native environments

2.2 Graph Analysis and Graph Learning Mainstream System Technology Analysis

The above table lists the analysis of mainstream graph analysis and graph learning systems. Our point of view is:

  • The mainstream graph analysis system is oriented to large-scale graph scenarios, mainly based on distributed memory architecture, multi-location offline graph calculation, does not support real-time data update, and has weak support for interactive query of complex OLAP classes.
  • The mainstream graph learning system is built on the existing deep learning system, mainly based on PyTorch, and the performance of distributed training is average.
  • The mainstream graph analysis system and graph learning system are separated, generally interacting through files, and can be further optimized in terms of unified graph sampling and graph and NN fusion scheduling.
  • The exploration of heterogeneous systems such as GPU and FPGA is still relatively rudimentary.

To sum up, graph databases and graph engines are oriented to a wide range of usage scenarios, and the capabilities they provide must also develop from basic to advanced. The part marked in red in the technical insights is also the necessary differentiation for providing high-level capabilities. Competitiveness.

3 Evolution of HUAWEI Cloud Atlas Database Technology

Since its launch in 2018, Huawei Cloud Image Database (GES) has gone through three periods. From 2018 to 21, it is the 1.0 era, from 22 years to the present, it is the 2.0 era, and it will evolve to the 3.0 era in the future.

 

The figure below shows the technical architecture and corresponding features of each version of GES, which will be analyzed in detail in the following sections.

3.1 GES 1.0: query analysis integration, high performance

GES 1.0 is based on a distributed memory architecture, focusing on query analysis integration and high-performance query and analysis. By storing only one piece of data, the graph query tasks and graph analysis tasks can be better considered. For example, data additions, deletions, and modifications can be queried immediately, and can quickly participate in subsequent computing tasks, eliminating the need for data synchronization between different systems. Of course, because this architecture uses distributed memory to store the full amount of data, the cost will be higher compared with the persistence solution, and the fault recovery in extreme cases will be longer. But in general, it is easy to deal with the processing and analysis of tens of billions of graph data.

3.2 GES 2.0: large scale, persistence, DSL, dynamic graph

GES 2.0 is the technical path for the current key development of products. The core is oriented to graph data scales ranging from hundreds of billions to trillions, reducing costs through persistent storage, while taking into account query efficiency, computing performance, and ease of use. Here, we decouple the graph database and the graph computing engine, and each component evolves directly and independently. At the same time, the data synchronization in the unified storage is built into the system, and the user does not need to perceive it, which ensures the consistency of the user experience when migrating from 1.0 to 2.0 sex.

 

Additionally, we evolve DSL and dynamic graphs as key features. Among them, DSL provides the ability of custom algorithm, and dynamic graph provides the ability of timing analysis.

DSL : Provide flexible and controllable GraphDSL to help users design and run algorithms/queries at low cost. Especially complex queries and customized calculation tasks, such as customized pagerank, repeat query, etc. There is no need to install and compile during the process, and there is no need to update the version, and the original usage habits are taken into account, and the writing method and calculation mode of Cypher and Gremlin are combined.

【Customized PageRank sample】

Dynamic graph : The world is ever-changing, and behind these changes lies important information (such as the timing impact of the spread of the epidemic, the sequence of transfer relationships, etc.), traditional graph analysis mainly adopts a static, single-perspective analysis method, only considering the static structure , ignoring changes, it is difficult to assist more accurate reasoning and decision-making. Dynamic graph analysis: Consider changes in the time dimension, comprehensively model and analyze the impact of dynamic and static information, and assist in accurate decision-making.

[Dynamic graph diagram: modeling, dynamic graph algorithm, visualization]

3.3 GES 3.0: Embrace large models and build Graph+AI engines

Facing the future, GES 3.0 will be built in the direction of Graph+AI engine. On the one hand, it combines large models to improve AI capabilities; on the other hand, it integrates multi-source data to better integrate into the big data ecosystem. Synchronization, ease of use, and ecological compatibility (GQL) also run through it.

The core concept of GES 3.0:

  • Composable : All components of the system can be combined and replaced to achieve long-term upgradeability
  • Unified : Provides query capabilities for multiple data types (Table, Graph)
  • AI-Centric : Deeply combined with large models, centered on AI, empowering GES

  • Integrate large models and automatically capture entity relationships to create knowledge graphs
  • Support joint reasoning of LLM, GNN and DL in graph query
  • Use Graph's powerful Multi-Hop reasoning capabilities and storage capabilities for real-time data/events to help correct the illusion and real-time problems of LLM

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

 

The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework formulation. Chinese programmers refused to write gambling programs and were pulled out 14 teeth, with 88% body damage . Podman Desktop, an open-source imitation Song font, breaks through 500,000 downloads. Automatically skips opening screen advertisements. The application "Li Tiao Tiao" stops updating indefinitely. There is a remote code execution vulnerability Xiaomi filed mios.cn website domain name
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10102130