Author: Zhou Yangjie, Shen Wenting

opening

Recently, the paper "Graph Neural Network Unified Graph Operator Abstraction uGrapher" co-authored by the team of Alibaba Cloud machine learning platform PAI and Shanghai Jiao Tong University's teacher Jing Jingwen was accepted by ASPLOS 2023.

In order to solve the static kernel performance problem of different graph operators in the framework of the current graph neural network on different graph data, uGrapher decouples the calculation and scheduling of graph operators by abstracting all graph operators into a unified intermediate expression form. It also defines the design space for optimizing graph operators on the GPU, and provides high-performance computing support for graph operators in graph neural networks by targeting dynamically changing graph operators and adaptively generating parallel execution strategies for graph data. Compared with DGL [1], PyG [2], and GNNAdvisor [3], uGrapher can achieve an average performance improvement of 3.5 times.

background

In recent years, Graph Neural Networks (GNNs) have attracted extensive attention from academia and industry due to their powerful ability to learn and infer graph structures in non-Euclidean spaces. GNNs combine DNN-based feature transformation with graph-based operations to propagate and aggregate information along the graph structure. Existing GNN frameworks such as DGL and PyTorch-Geometric (PyG) extend DNN frameworks (such as TensorFlow and PyTorch) and introduce the concept of "messages", which are intermediate values of feature vectors associated with each edge. For any operation on the graph \(G=(V,E)\) , it can be divided into three stages according to the attributes of the data and the direction of data movement, namely message creation, message aggregation and feature update, and the formula is as follows:

Among them, \(u\) and \(v\) are vertex indexes, \(e\) are edge indexes between \(u\) and \(v\) ; \(h_v\) are vertex \(v \) is the eigenvector, \(m_e\) is the message on the edge \(e\) .

uGrapher defines operators that need to traverse the input graph structure as graph operators. Graph operators include three categories: "message creation", "message aggregation" and "fusion aggregation". Among them, "fusion aggregation" means that when "message creation" is a simple copy operation, it can be fused with "message aggregation" to avoid redundant access. DGL and PyG adopt this fusion optimization.

Taking the GAT model as an example, it contains several graph operators with different computation modes. The first "message creation" operation is very lightweight, it adds the features of the source and target vertices of each edge as a message to calculate the attention weight; the second "fusion aggregation" operation first copies the features from the source vertex , then multiply the attention weights edge by edge, and finally aggregate the messages on the transformed edges into new features of vertices. The second operation is more computationally intensive than the first.

Due to the irregular memory behavior caused by the graph structure, coupled with the complex arithmetic calculations in these graph operators, high-performance computing of graph operators in graph neural networks becomes an important challenge.

Existing graph neural network frameworks rely on handwritten static kernels to realize the computation of graph operators. However, with the evolution of graph neural network algorithms, the variability and complexity of graph operators continue to increase, and their calculations become more complex. complexity, which makes it difficult for static operators to maintain good performance. Therefore, this paper explores how to perform computational optimization of graph operators on changing graph data and graph models.

challenge

(1) The graph neural network introduces the two major characteristics of the complexity of the graph operator and the variability of the graph data, which leads to the difficulty in the calculation and optimization of the graph operator.

The table below categorizes the 160 graph operators supported by DGL according to the input and output data types. Even with the same input or output data types, graph operators can perform different modes of computation. The complexity of graph operators makes it difficult to find a static way to provide high-performance support for all graph operators' computing operations.

Graph datasets in the real world vary greatly. The scale of the graph, i.e. the number of vertices and edges, the degree of graph balance, i.e. the standard deviation of the non-zero values of the rows of the adjacency matrix, and the feature and class sizes, are properties that vary significantly between different graphs. These differences affect the memory usage and computational complexity of graph operators.

(2) Due to the lack of system optimization methods, the underlying CUDA kernel used by the existing GNN framework has problems of inefficiency and lack of flexibility.

DGL calls static CUDA kernels when supporting the message-passing programming interface above, and these static kernels cannot adapt to changing computing scenarios. For example, when executing an unbalanced graph, the low utilization of the GPU leads to a waste of hardware resources. GPU performance is often limited by parallelism when executing small graphs, while access bandwidth becomes a bottleneck when executing large graphs due to low locality. At the same time, these indicators will also vary among different graph operators.

Breaking

uGrapher uses nested loops as the scheduling expression of graph operators, and allows users to customize input tensors and function operations at different stages to represent different graph operator operations.

The figure below shows the abstract details of the unification of graph operators in the graph neural network.

edge_op realizes the function representation of memory access and calculation on the edge, and gather_op realizes the merge function representation of edge to vertex. There are also three input tensors, which can be the source vertex embedding tensor (Src_V), the destination vertex embedding tensor (Dst_V), the edge embedding tensor (Edge), and any one of NULL. The data type of the tensor also determines the different addressing modes in the loop computation (lines 10 to 12).

The following formula formally defines the unified abstraction of uGrapher, where \(\psi\) is the edge_op function and \(\rho\) is the gather_op function. This abstraction captures the full semantics of graph operators, including their computation and memory movement patterns.

According to the unified abstraction of graph operators, uGrapher constructs an operator-optimized design space to achieve high-performance graph operator execution.

uGrapher uses locality, parallelism, and work efficiency to describe the performance indicators of graph operators on GPU. Applying tiling or blocking techniques to nested loops can improve the locality of graph operators; by starting more threads, warps or thread blocks, the parallelism of graph operators can be improved; work efficiency is represented by the reciprocal of overhead, the same operator Different execution strategies may introduce additional calculations, such as address calculations, and edge parallel calculations of shared vertices may require atomic instructions.

There are two classic parallelization strategies in existing graph processing systems: thread-vertex and thread-edge parallelism. The former reduces parallelism but improves output data reuse and locality. The latter reduces productivity because atomic update operations may be required.

Since the vertex/edge feature in GNN is a vector, GNN increases the parallelization strategy of the feature dimension, which is warp-vertex and warp-edge. Compared with the thread-vertex/edge strategy, more warps can be started, thereby increasing parallelism. However, this strategy also hurts locality due to the reduced cache capacity per warp.

Therefore, there is no single strategy that can improve these three indicators at the same time. Through the above-mentioned unified IR expression, uGrapher designs a unified high-performance computing interface to explore the optimization space and perform performance trade-offs. The overall architecture is shown in the figure below.

The graph operator unified high-performance computing interface design provided by uGrapher is shown in the figure below.

The uGrapher interface contains three parameters: graph_tensor, which represents the graph data; op_info, which is used to pass edge_op, gather_op and the calculation information of the input tensor; parallel_info, which is used to specify the parallelization strategy.

The interface design of uGrapher separates operator computation, graph data, and parallelization strategies, so that users can choose execution strategies manually or by proposing their own heuristic algorithms for different operators and graph structures. At the same time, when the user does not specify any parallelization strategy, uGrapher will use LightGBM [4] to train the decision model, select the optimal strategy in the parallelization space to automatically adjust to the best parallelization strategy, and use it in different GPU architectures and graphs. Provide dedicated and optimal computing scheduling for all graph operators in graph neural networks on the dataset. uGrapher implements CUDA kernel templates for each parallelization strategy, reserves device function interfaces for each graph operator, and implements end-to-end code generation, including operator merging and device function generation, to support flexible and efficient It's being realized. Read our ASPLOS 2023 paper for more details.

Currently, Alibaba Cloud is integrating the key design of uGrapher into PAI's self-developed large-scale graph neural network framework GraphLearn, thereby bringing performance acceleration to industrial-level graph neural network applications.

PAI has been recruiting interns for a long time. If you are interested in distributed deep learning training framework, distributed graph neural network training framework, computing and communication optimization, please send your resume to [email protected] or baole. [email protected]

Fifth plate:

Paper title:

uGrapher: High-Performance Graph Operator Computation via Unified Abstraction for Graph Neural Networks

Author of the paper:

Zhou Yangjie, Calm Wen, Song Yaoxu, Lu Shuwen, Wang Mian, Li Chao, Min Yi, Shen Wenting, Li Yong, Lin Wei, etc.

Paper pdf link:

https://dl.acm.org/doi/10.1145/3575693.3575723

references:

[1] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai et al., “Deep graph library: A graph-centric, highly-performant package for graph neural networks,” arXiv preprint arXiv:1909.01315, 2019.

[2] M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch geometric,” arXiv preprint arXiv:1903.02428, 2019.

[3] Y. Wang, B. Feng, G. Li, S. Li, L. Deng, Y. Xie, and Y. Ding, “GNNAdvisor: An adaptive and efficient runtime system for GNN acceleration on GPUs,” in 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), 2021, pp. 515–531.

[4] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 ( 2017).

[ASPLOS 2023] Graph neural network unifies graph operator abstraction uGrapher, greatly improving computing performance

opening

background

challenge

Breaking

Guess you like