Open source! Learn about Alibaba's one-stop graph computing platform GraphScope in one article

1. What is graph computing

Graph data models a set of objects (vertices) and their relationships (edges), which can intuitively and naturally represent various entity objects in the real world and their relationships. In the big data scenario, social networks, transaction data, knowledge graphs, transportation and communication networks, supply chains, and logistics planning are all typical examples of graph modeling. Figure 1 shows the graph data of Alibaba in the e-commerce scenario, which has various types of vertices (consumers, sellers, items, and devices) and edges (representing the relationship of purchase, view, comment, etc.). In addition, each vertex is also associated with rich attribute information.

Figure 1: Alibaba e-commerce scene graph data example

Such graph data in actual scenes usually contains billions of vertices and trillions of edges. In addition to the large scale, the continuous update speed of this graph is also very fast, and there may be nearly a million updates per second. With the continuous growth of the application scale of graph data in recent years, exploring the internal relationships of graph data and computing on graph data has received more and more attention. According to the different goals of graph computing, it can be roughly divided into three types of tasks: interactive query, graph analysis, and graph-based machine learning.

1 Interactive query of graphs

 

Figure 2: On the left, an example of financial anti-fraud; on the right, an example of learning.

In the application of graph computing, the business usually needs to view graph data in an exploratory way to locate some problems in time and analyze some in-depth information. The (simplified) graph model in Figure 2 (left) can be used For financial anti-fraud (illegal cash out of credit card) detection. By using fake identifiers, "criminals" can obtain short-term credit from banks (Vertex 4). He tries to cash out the currency with the help of the merchant (vertex 3) with a fake purchase (edge ​​2->3). Once the payment (edge ​​4->3) is received from the bank (vertex 4), the merchant returns the money (via edges 3->1 and 1->2) to the "criminal" through multiple accounts under its name. This pattern eventually forms a closed loop on the graph (2->3->1...->2). In a real scenario, the online scale of graph data may include billions of vertices (for example, users) and hundreds of billions to trillions of edges (for example, payment transactions), and the entire fraud process may involve many entities. The dynamic transaction chain of various constraints requires complex real-time interactive analysis to be well identified.

2 Graph analysis

The research on graph analysis and calculation has been going on for decades, and many graph analysis algorithms have been produced. Typical graph analysis algorithms include classic graph algorithms (for example, PageRank, shortest path, and maximum flow), community detection algorithms (for example, maximum clique/clique, joint flux calculation, Louvain, and tag propagation), and graph mining algorithms (for example, frequent set Pattern matching between mining and graphs). Due to the diversity of graph analysis algorithms and the complexity of distributed computing, distributed graph analysis algorithms often need to follow a certain programming model. The current programming model is a kind of central model "Think-like-vertex", matrix-based model and subgraph-based model, etc. Under these models, various graph analysis systems have emerged, such as Apache Giraph, Pregel, PowerGraph, Spark GraphX, GRAPE, etc.

3 Graph-based machine learning

Classic Graph Embedding technologies, such as Node2Vec and LINE, have been widely used in various machine learning scenarios. The graph neural network (GNN) proposed in recent years combines the structure and attribute information in the graph with the features in deep learning. GNN can learn low-dimensional representations for any graph structure in the graph (for example, vertices, edges, or the entire graph), and the generated representations can be classified, link prediction, clustering, etc. by many downstream graph-related machine learning tasks. Graph learning techniques have been proven to have convincing performance on many graph-related tasks. Different from traditional machine learning tasks, graph learning tasks involve graphs and neural network related operations (see Figure 2 on the right). Each vertex in the graph uses graph-related operations to select its neighbors and combine the characteristics of their neighbors Converge with neural network operations.

Two-picture computing: the cornerstone of the next generation of artificial intelligence

Not only Alibaba, graph data and computing technology have been hotspots in academia and industry in recent years. In particular, in the past ten years, the performance of graph computing systems has increased by 10 to 100 times, and the system is still becoming more and more efficient, which makes it possible to accelerate AI and big data tasks through graph computing. In fact, because graphs can express various complex types of data very naturally, and can provide abstractions for common machine learning models. Compared with dense tensors, graphs can provide richer semantics and more comprehensive optimization functions. In addition, graphs are a natural expression of sparse high-dimensional data, and more and more studies in graph convolutional networks (GCN) and graph neural networks (GNN) have proved that graph computing is an effective supplement to machine learning, and the results can be explained. Sex, deep reasoning, causation, etc. will play an increasingly important role.

Figure 3: Graph computing has broad application prospects in various fields of AI

It is foreseeable that graph computing will play an important role in various applications of next-generation artificial intelligence, including anti-fraud, smart logistics, urban brain, bioinformatics, public safety, public health, urban planning, anti-money laundering, infrastructure, recommendation System, financial technology and supply chain and other fields.

Three-picture calculation status

After these years of development, there have been many systems and tools for various graph calculation needs. For example, in terms of interactive query, there are graph databases such as Neo4j, ArangoDB, and OrientDB, as well as distributed systems and services such as JanusGraph, Amazon Neptune, and Azure Cosmos DB; in terms of graph analysis, there are systems such as Pregel, Apache Giraph, Spark GraphX, and PowerGraph; There are DGL, pytorch geometric, etc. in map learning. Nevertheless, in the face of rich graph data and diversified graph scenarios, effective use of graph computing to enhance business effects still faces huge challenges:

  • Graph calculation scenarios in real life are diverse and usually very complex, involving multiple types of graph calculations. The existing systems are mainly designed for specific types of graph computing tasks. Therefore, users must decompose complex tasks into multiple jobs involving many systems. A lot of additional overheads such as integration, IO, format conversion, network and storage may be generated between systems.
  • It is difficult to develop applications for large-scale graph calculations. In order to develop graph computing applications, users usually use simple and easy-to-use tools (such as NetworkX and TinkerPop in Python) to start with small-scale graph data on a single machine. However, for ordinary users, it is extremely difficult to extend their stand-alone solution to a parallel environment to process large-scale graphs. Existing distributed systems for large-scale graphs usually follow different programming models and lack the rich out-of-the-box algorithm/plug-in library in stand-alone libraries (such as NetworkX). This makes the threshold for distributed graph computing too high.
  • The scale and efficiency of processing large images are still limited. For example, due to the high complexity of the travel mode, the existing interactive graph query system cannot execute Gremlin queries in parallel. For graph analysis systems, the traditional point-centered programming model makes graph-level existing optimization techniques no longer available. In addition, many existing systems are basically not optimized at the compiler level.

Let's take a look at the limitations of the existing system through a specific example.

1 Example: Paper classification prediction

The data set ogbn-mag is a data set from Microsoft Academic. There are four types of points in the data, which represent papers, authors, institutions, and research fields. Between these points, there are four sides that represent relationships: the author "written" the paper, and the paper "cited" another one. For papers, the author "belongs to" an institution, and the paper "belongs to" a research field. This data can naturally be modeled with graphs.

A user expects to perform a classification task on the "papers" published in 2014-2020 on this graph, and expects to be able to base the paper on the data graph's structural attributes, its own topic characteristics, and the degree of reunion such as kcore and triangle-counting. The measurement parameters of, categorize them and predict the subject category of the article. In fact, this is a very common and meaningful task. This prediction can help researchers better discover potential cooperation and research hotspots in the field by considering the citation relationship of the paper and the theme of the paper.

Let's break down this calculation task: First, we need to filter the paper and its related points and edges according to the year, and then need to calculate the kcore, triangle-counting and other full-image calculations on this graph, and finally combine these two parameters and The original features on the map are put together into a machine learning framework for classification training and prediction. We found that the current existing systems cannot solve this problem well end-to-end. We can only operate by organizing multiple systems into a pipeline:

Figure 4: Workflow composed of multiple systems for paper classification and prediction

This task seems to be solved. In fact, there are many problems hidden behind the pipeline scheme. For example, multiple systems are independent and separated from each other, and intermediate data is frequently placed to transfer data between systems; the program of graph analysis is not a declarative language, and there is no fixed paradigm; the scale of the graph affects the efficiency of the machine learning framework, and so on. These are the problems that we often encounter in real-world graph computing scenarios. In summary, they can be summarized as the following three points:

  • The graph calculation problem is very complicated, the calculation mode is diverse, and the solution is fragmented.
  • Graph calculation learning is difficult, costly, and has a high threshold.
  • The scale of the graph and the amount of data are large, the calculation is complicated, and the efficiency is low.

In order to solve the above problems, we designed and developed a one-stop open source graph computing system: GraphScope.

What is GraphScope

GraphScope is a one-stop graph computing platform developed and open sourced by the Intelligent Computing Laboratory of Alibaba Dharma Academy. Relying on Ali's massive data and rich scenarios, and the high-level research of Dharma Academy, GraphScope is committed to providing one-stop and efficient solutions to the above-mentioned challenges in actual production of map calculations.

GraphScope provides a Python client, which can easily connect upstream and downstream workflows. It has the characteristics of one-stop, convenient development, and extreme performance. It has efficient cross-engine memory management, supports Gremlin distributed compilation optimization for the first time in the industry, and supports automatic parallelization of algorithms and automatic incremental processing of dynamic graph updates, providing the ultimate performance of enterprise-level scenarios. In Alibaba's internal and external applications, GraphScope has proven to achieve important new business value in multiple key Internet areas (such as risk control, e-commerce recommendations, advertising, network security, knowledge graphs, etc.).

GraphScope gathers a number of academic research results of Dharma Academy. Its core technology has won the best paper award of SIGMOD2017, VLDB2017 best presentation award, VLDB2020 best paper nomination award, and SAIL award of the world artificial intelligence innovation competition. . The paper on GraphScope's interactive query engine has also been accepted by NSDI 2021 and will be published soon. There are more than a dozen research results around GraphScope published in top academic conferences or journals in the field, such as TODS, SIGMOD, VLDB, KDD, etc.

1 Architecture introduction

Figure 5: GraphScope system architecture diagram

The bottom layer of GraphScope is a distributed memory data management system vineyard[1]. Vineyard is also an open source project of ours. It provides an efficient and rich IO interface responsible for interacting with the lower-level file system. It provides efficient and high-level data abstraction (including but not limited to graphs, tensor, vector, etc.), support Manage data partitions, metadata, etc., and provide native zero-copy data reading for upper-level applications. It is this point that supports GraphScope's one-stop capability: between cross-engines, graph data exists in the vineyard in the form of partitions and is managed uniformly by the vineyard.

In the middle is the engine layer, which is composed of interactive query engine GIE, graph analysis engine GAE, and graph learning engine GLE, which we will introduce in detail in subsequent chapters.

The top layer is development tools and algorithm libraries. GraphScope provides a variety of commonly used analysis algorithms, including connectivity calculations, community discovery, and PageRank, centrality and other numerical calculations. The algorithm package will continue to be expanded in the future, and it will provide compatibility with the NetworkX algorithm library on super large-scale graphs. Skills of analyze. In addition, it also provides a rich graph learning algorithm package, built-in support for GraphSage, DeepWalk, LINE, Node2Vec and other algorithms.

2 Resolve the problem: paper classification prediction

With GraphScope, a one-stop computing platform, we can solve the problems in the previous example in a simpler way.

GraphScope provides a Python client, allowing data scientists to complete all graph calculation-related tasks in an environment that they are familiar with. After opening Python, we first need to establish a GraphScope session.

import graphscope
from graphscope.dataset.ogbn_mag import load_ogbn_mag

sess = graphscope.sesson()
g = load_ogbn_mag(sess, "/testingdata/ogbn_mag/")

In the above code, we created a GraphScope session and loaded the graph data.

GraphScope is designed for cloud native. Behind a session corresponds to a set of k8s resources. The session is responsible for the application and management of all resources in this session. Specifically, behind the user's line of code, the session first requests a pod of the back-end coordinator. Coordinator is responsible for all communication with the Python client. After completing its initialization, it will pull up a set of engine pods. Each pod in this group of pods has a vineyard instance, which together form a distributed memory management layer; at the same time, each pod has three engines, GIE, GAE, and GLE, and their start and stop states are determined by the Coordinator in the follow-up on demand management. When this group of pods is pulled up and establishes a stable connection with the Coordinator and completes the health check, the Coordinator will return to the client to tell the user that the session has been successfully pulled up and the resources are ready to start uploading pictures or calculations.

interactive = sess.gremlin(g)

# count the number of papers two authors (with id 2 and 4307) have co-authored
papers = interactive.execute("g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()").one()

First, we set up an interactive query object interactive on the graph g. This object pulls up a set of interactive query engine GIE in the engine pod. Then the following is a standard Gremlin query statement. The user wants to view the collaborative papers of two specific authors in this data. This Gremlin statement will be sent to the GIE engine for disassembly and execution.

The GIE engine is composed of core components such as parallel Compiler, memory and scheduling management, Operator runtime, adaptive travel strategy, and distributed Dataflow engine. After receiving the interactive query statement, the statement will first be split by Compiler and compiled into multiple operation operators. These operators are then driven and executed in a distributed data flow model. In this process, each computing node holding partition data runs a copy of the data flow, processes the data in the partition in parallel, and in the process Data exchange is performed on-demand in the system, so that Gremlin queries can be executed in parallel.

Under the complex grammar of Gremlin, the travel strategy is very important and affects the parallelism of the query. Its choice directly affects the resource occupation and query performance. Simply relying on BFS or DFS cannot meet the demand in reality. The optimal travel strategy often needs to be dynamically adjusted and selected based on specific data and queries. The GIE engine provides an adaptive travel strategy configuration, and selects the travel strategy according to the query data, disassembled Op and Cost models, in order to achieve the efficiency of operator execution.

# extract a subgraph of publication within a time range
sub_graph = interactive.subgraph("g.V().has('year', inside(2014, 2020)).outE('cites')")

# project the projected graph to simple graph.
simple_g = sub_graph.project_to_simple(v_label="paper", e_label="cites")

ret1 = graphscope.k_core(simple_g, k=5)
ret2 = graphscope.triangles(simple_g)

# add the results as new columns to the citation graph
sub_graph = sub_graph.add_column(ret1, {"kcore": "r"})
sub_graph = sub_graph.add_column(ret2, {"tc": "r"})

After going through a series of interactive queries for single-point viewing, the user begins to do graph analysis tasks through the above statements.

First, it uses a subgraph operator to extract a subgraph from the original image according to the filter conditions. Behind this operator, the interactive engine GIE executes a query, and then writes the result graph into the vineyard.

Then the user extracts the points whose label is the paper and the edges whose relationship is cites on this new graph, and produces an isomorphic graph, and calls GAE's built-in algorithm k-core and triangle counting on it. Triangles has done analytical calculations in the whole picture. After the results are produced, these two results are added back to the original image as attributes on the points. Here, with the help of vineyard metadata management and high-level data abstraction, the new sub_graph is generated by the transformation of a new column on the original image, and there is no need to reconstruct all the data of the entire image.

The core of the GAE engine inherits the GRAPE system that has won the SIGMOD2017 Best Paper Award [2]. It is composed of high-performance runtime, automatic parallelization components, multi-language support SDK and other components. The above example uses GAE's own algorithm. In addition, GAE also supports users to write their own algorithms very simply and plug and play on them. Users write algorithms based on the PIE model of subgraph programming, or reuse existing graph algorithms without considering the distributed details. GAE will do automatic parallelization, which greatly reduces the high threshold for users of distributed graph computing. At present, GAE supports users to write their own algorithm logic in C++, Python (Java will be supported in the future) and other languages, plug and play in a distributed environment. GAE's high-performance runtime is based on MPI, with very detailed optimization of communication, data arrangement, and hardware features to achieve the ultimate performance.

# define the features for learning
paper_features = []
for i in range(128):
    paper_features.append("feat_" + str(i))

paper_features.append("kcore")
paper_features.append("tc")

# launch a learning engine.
lg = sess.learning(sub_graph, nodes=[("paper", paper_features)],
                  edges=[("paper", "cites", "paper")],
                  gen_labels=[
                      ("train", "paper", 100, (1, 75)),
                      ("val", "paper", 100, (75, 85)),
                      ("test", "paper", 100, (85, 100))
                  ])

Next, we start to use the graph learning engine to classify papers. First, we configure the 128-dimensional feature of the paper node in the data and the two attributes of kcore and triangles we calculated in the previous step as the training feature. Then we pull up the graph learning engine GIE from the session. When pulling up the graph lg in GIE, we configure the graph data, feature attributes, specify the type of edge, and divide the point set into a training set, a validation set, and a test set.

from graphscope.learning.examples import GCN
from graphscope.learning.graphlearn.python.model.tf.trainer import LocalTFTrainer
from graphscope.learning.graphlearn.python.model.tf.optimizer import get_tf_optimizer

# supervised GCN.

def train_and_test(config, graph):
    def model_fn():
        return GCN(graph, config["class_num"], ...)

    trainer = LocalTFTrainer(model_fn,
                             epoch=config["epoch"]...)
    trainer.train_and_evaluate()

config = {...}

train_and_test(config, lg)

Then we use the above code to select the model and do some training-related parameter configuration, and it is very convenient to use GLE to start the image classification task.

The GLE engine consists of two parts, Graph and Tensor, which are composed of various Operators. The graph part involves the connection of graph data and deep learning, such as batch iteration, sampling, and negative sampling, and supports isomorphic graphs and heterogeneous graphs. The Tensor part is composed of various deep learning operators. In the calculation module, the graph learning task is disassembled into operators, and the operators are executed in a distributed manner during runtime. In order to further optimize the sampling performance, GLE will cache remote neighbors, frequently accessed points, attribute indexes, etc. to speed up the search of vertices and their attributes in each partition. GLE uses an asynchronous execution engine that supports heterogeneous hardware, which allows GLE to effectively overlap a large number of concurrent operations, such as I/O, sampling, and tensor calculations. GLE abstracts heterogeneous computing hardware into resource pools (for example, CPU thread pool and GPU stream pool), and cooperates in scheduling fine-grained concurrent tasks.

Five performance

GraphScope not only solves the problem of graph calculation in one-stop ease of use, but also achieves the ultimate in performance, meeting enterprise-level needs. We used LDBC Benchmark to evaluate and compare the performance of GraphScope.

As shown in Figure 6, in the interactive query test LDBC SNB Benchmark, GraphScope deployed on a single node is more than an order of magnitude faster than the open source system JanusGraph; in distributed deployment, GraphScope's interactive query can basically achieve linearity. Accelerated scalability.

Figure 6: GraphScope interactive query performance

In the graph analysis test LDBC GraphAnalytics Benchmark, GraphScope compares with PowerGraph and other latest systems, and it takes the lead in almost all combinations of algorithms and data sets. In some algorithms and data sets, compared with other platforms, there are five times the performance advantage at the lowest. The partial data is shown in the figure below.

Figure 7: GraphScope graph analysis performance

For the experiment setting, reproduction and complete performance comparison, please refer to interactive query performance [3] and graph analysis performance [4].

Six embrace open source

GraphScope's white paper and code have been  open sourced at  http://github.com/alibaba/graphscope[5 ], and the project complies with Apache License 2.0. Welcome everyone to star, try, and participate in graph calculations. You are also welcome to contribute code to build the best graph computing system in the industry. Our goal is to continuously update the project and continuously improve the integrity of the function and the stability of the system. You are also welcome to follow the website  http://graphscope.io  to follow up on the latest status of the project.

 

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/113938300