[EuroSys2023 Best Poster] Extremely low-latency GNN inference sampling service for dynamic graphs

Author: Shen Wenting

GraphLearn is a large-scale graph neural network training framework in the industry jointly built by the PAI team of the Alibaba Cloud machine learning platform and the graph computing team of the Intelligent Computing Laboratory of DAMO Academy. It is also the graph learning engine of the one-stop graph computing platform GraphScope. GraphLearn's latest open source GNN online inference real-time sampling service (DGS) for dynamic graphs. DGS has the ability to process real-time high-throughput graph updates, and can guarantee low-latency, high-concurrency inference sampling query processing. The performance of its graph update and sampling query is linearly scalable in a distributed environment. Recently, "Dynamic Graph Sampling Service for Real-time GNN Inference at Scale" jointly published by the GraphLearn team and Zhejiang University was selected as the best poster of EuroSys2023.

image.png

Poster address: https://2023.eurosys.org/docs/posters/eurosys23posters-final40.pdf
Open source project address: GraphLearn , GraphScope

background introduction

The GNN model represents high-order neighborhood information through a graph structure. In large-scale industrial implementation, a common training method is to reduce communication and computing overhead through neighborhood sampling to obtain distributed scalability. At the same time, in real business scenarios such as recommendation and financial anti-fraud, the structure and properties of the graph often change dynamically over time, and the GNN model needs to be able to sample and represent the dynamic information of these neighborhoods in real time.

Because online learning is easy to cause model jitter, in actual production applications, the deployment of the model usually needs to go through a complex production link, so the near-line model is generally used for deployment. In order to allow the GNN model to represent neighborhood information in real time, the GNN During the reasoning process of the model, real-time reasoning needs to be performed through the real-time sampling graph structure and attributes .

In order to ensure user experience, this real-time reasoning task has extremely low latency requirements, leaving very little latency space for sampling queries. At the same time, due to the data scale of industrial large-scale graphs and the QPS of online inference services, they often exceed the storage and computing capabilities of a single machine. Therefore, we need to provide a real-time sampling service (P99 within 20 milliseconds) for GNN model inference that guarantees extremely low latency on large-scale dynamic graphs , and has the ability to scale linearly in a distributed environment.

challenge

The intuitive method of the real-time graph sampling service is to maintain a dynamic graph storage and query module, perform neighbor sampling calculation and attribute collection on the requested point when the reasoning request arrives, and use the samples obtained from the sampling calculation as the input of the model service for reasoning. However, the distribution of graph data and the load characteristics of inference sampling make it difficult for this intuitive approach to achieve stable low-latency sampling on distributed dynamic graphs. Specifically, there are the following challenges:

  1. Neighbor sampling needs to traverse all the neighbors, and with the dynamic changes of the graph, the neighbors are constantly changing, it is difficult to ensure the low delay of complex sampling calculations, and the existence of super points also causes the delay to be unstable.
  2. Due to the unbalanced distribution of graph data, the storage and computing loads on each graph slice are unevenly distributed, resulting in unstable sampling delays and challenges for distributed down-line scaling.
  3. Inference sampling is generally multi-hop sampling, and it is necessary to collect dynamic attributes on vertices or edges. On a distributed graph, the network and local I/O overhead caused by multi-hop sampling and attribute access has a great impact on latency. .

key design

Different from the workload of general graph databases, when the dynamic graph inference sampling service serves the online inference of a given model, its corresponding graph sampling has a fixed pattern. For example, a GraphSAGE model on a common User-Item, Item-Item bipartite graph, the sampling pattern of this graph is generally for the requested user ID (feed_id), according to the timestamp as a probability sampling of its latest two latest purchased products, for These 2 commodities sample the 2 commodities with the highest correlation coefficients. Use the GSL (Graph Sampling Language) provided by GraphLearn to express the following query:

image.png

Figure 1: Two-hop sampling query

This fixed-pattern query provides a stable low-latency service opportunity for large-scale dynamic graph sampling.

Key points of DGS system design:

  1. Storage-computing separation and Query-aware Cache

DGS separates graph storage and sampling computation. Sampling calculation generally refers to random sampling, latest neighbor sampling (topk timestamp), or probability distribution sampling through edge weight (or edge timestamp). The time complexity of the first two types of sampling is O(1). Probability distribution sampling is usually implemented using the Alias ​​Method. The Alias ​​Table needs to be repeatedly calculated on the changing probability distribution in the dynamic graph, and its time and space complexity are both O(N). , where N is the number of neighbors of the vertex, and it keeps changing. Different from the simple reading and writing of graph storage, the graph sampling process includes storage reading and writing and complex calculations. Therefore, we first separate storage and computing, and on the computing side, the system caches the data that needs to be accessed by a specific query of the service in advance. to improve the spatial locality of graph sampling calculations.

  1. Event-driven pre-sampling

In order to speed up the response to the sampling request, DGS advances the sampling calculation of each vertex from the time of request input to the time when the graph update event occurs, using space for time, so that only the enumeration needs to be completed when the sampling request occurs. At the same time, in order to reduce the staleness of graph update events from occurrence to sample generation, DGS adopts the flow sampling method, through the weighted reservoir sampling algorithm, when each update arrives, according to the pre-installed Query, stream sampling. This graph update event-driven sampling pre-sampling method makes the graph data storage space and calculation time for each vertex become a constant *O(K), where K is the size of the reservoir. The problem in Challenge 1 is solved by pre-storing the results of graph sampling calculations in the cache.

  1. Multi-hop dismantling and lazy splicing

So far, DGS has solved the real-time one-hop sampling of input vertices. However, DGS mainly serves multi-hop sampling. Taking two-hop sampling as an example, after the first-hop result of the input vertex is updated, the corresponding second-hop result also needs to be updated (at the same time, the collected attributes are updated). In the case of more hops, the exponentially increasing read and write overhead caused by this chain reaction has a huge impact on the delay of sampling requests. The way DGS solves this problem is to disassemble the graph sampling according to each hop according to the pre-installed query. For each hop sampling, all vertices of the corresponding vertex type in the graph are pre-sampled and stored correspondingly. For example, the Query in Figure 1 can be disassembled as shown in Figure 2, combined with Event-driven pre-sampling, the samples corresponding to each vertex are stored and updated in the reservoir, as shown in Figure 3.

In addition, DGS postpones the splicing of multi-hop samples to the moment when the corresponding inference sampling request occurs (lazy splicing), so as to avoid continuous updating after splicing in advance.

image.png

Figure 2: Disassembly of the two-hop sampling query

image.png

Figure 3: Event-driven update

  1. Subscribe-publish mechanism

We delay the multi-hop splicing until the request occurs. However, the multi-hop results are often stored on different shards, and cross-machine communication brings a lot of network communication overhead. Therefore, DGS designed a set of subscription-publishing mechanism, which routes the requested id to the corresponding service machine according to a specific sharding algorithm, and the machine subscribes to the updates of these ids and its multi-hop neighbors. As the neighbor relationship changes, the subscription table is constantly updated.

  1. read-write isolation

According to the above system design, when a sampling request occurs, DGS will route it to the specified worker, and perform local query to obtain multi-hop sampling results. In order to give priority to guaranteeing the latency of reading and the staleness of writing at the same time, DGS performs priority scheduling when scheduling read and write tasks. At the same time, in terms of system architecture, the tasks of frequent calculation and updating storage (writing) and the task of responding to sampling requests (reading) are placed on different machines to isolate reads and writes.

system structure

The core component architecture of the DGS system is shown in the figure below, mainly Sampling Worker and Server Worker components.

image.png

Figure 4: DGS system core architecture

Graph updates are sent to the corresponding partitions of the Sampling Worker according to the Key (such as vertex ID) fragmentation. Each Sampling Worker is responsible for a specific partition: perform one-hop pre-sampling and send the result to the Serving Worker. Each Serving Worker caches the sampling results of K one-hop queries received from the Sampling Worker, and responds to the reasoning sampling request of the vertices of a specific slice in the whole graph.

Sampling Workers and Serving Workers can independently perform elastic scaling to cope with load changes in graph updates and inference requests. In order to minimize the delay in generating complete K-hop sampling results, DGS sends all K-hop sampling results of vertex $V_{i}$ to the Serving Worker that responds to $V_{i}$ inference requests in advance, so that the K-hop graph sampling calculation Converted to operations that only need to access the local cache on the Serving Worker. In order to achieve this, each Sampling Worker maintains a subscription table for each one-hop query, recording the list of Serving Workers that subscribe to the results of the one-hop query. For example, the vertices

The addition or deletion of $V_{j}$ from the one-hop sample of $V_{i}$ will trigger a message to send the event to the Sampling Worker of the partition containing $V_{j}$, and update $V_{j} accordingly Subscription information for $.

Through this design, DGS can exhibit very stable delay performance under the load of high concurrent inference sampling.

performance

Experiments on real Alibaba e-commerce datasets show that DGS can keep the P99 latency of inference requests (two-hop random sampling queries) within 20ms, and the QPS of a single Serving Worker is about 20,000, and it can be expanded linearly. The throughput of graph data update reaches 109MB/s, which can also be linearly expanded.

image.png

Figure 5: Experimental configuration and performance data

epilogue

This paper provides a technical interpretation of DGS and introduces the design ideas of the DGS core model. In fact, as a service, DGS also includes a service pull-up module, a high-availability module, a data loading module, and a client connected to the model service. With the help of DGS, users can reason and obtain the latest graph based on the real-time changing graph structure and characteristics. characterization. We provide an end-to-end tutorial for GraphLearn-based training, model deployment, and online inference based on DGS . Welcome to try it out! For more details, please refer to the source code and technical documentation .

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/9866623