Tencent's Angel Graph team refreshed the OGB world record for the strongest GNN list!

picture

Introduction / Introduction

Recently, in the international top graph learning standard OGB (Open Graph Benchmark) challenge, Tencent Big Data Angel Graph team joined Peking University-Tencent Collaborative Innovation Lab to take advantage of the three largest OGB classification datasets: ogbn- The three tasks of papers100M, ogbn-products and ogbn-mag ranked first !

picture

OGB is currently recognized as the most authoritative benchmark dataset for general performance evaluation of graph learning. It was established and open sourced by the team of Professor Jure Leskovec of Stanford University, and has attracted Stanford University, Cornell University, Facebook, NVIDIA, Baidu, Alibaba and ByteDance and other top international universities and technology giants to participate. The dataset comes from a wide range of sources—covering fields such as biological networks, molecular graphs, academic networks, and knowledge graphs, as well as basic graph learning tasks such as node prediction, edge prediction, and graph prediction. The data is real and challenging. Known as "ImageNet" in the field of graph neural network, it has become a "testing stone" for global graph neural network researchers to test their own skills.

1. Background of the problem

Graph-structured data widely exists in real life. As shown in Figure 1, a graph can be used to represent the interaction between users and products in a recommender system, a graph can be used to represent the relationship between various users or entities in social networks and knowledge graphs, and graphs can also be used to model Various drugs and new materials. With the rise of deep learning, researchers began to apply deep learning to graph data, which promoted the vigorous development of related research in the field of graphs. As an important technical means in the field of graph, graph neural network (GNN) has become an important algorithm for solving graph problems, and has received extensive attention from academia and industry.

picture

Figure 1 Common graph data types

2. Technical difficulties

Although GNN has achieved great success in multiple application scenarios, there are still two problems in using GNN on large-scale datasets in industry:

1. Low scalability

1.1 Single-machine storage problem (High Memory Cost)

There are two operations in a traditional GNN layer, feature propagation and nonlinear transformation. Among them, the feature propagation operation involves a large multiplication of the sparse adjacency matrix and the feature matrix, which takes more time; moreover, if GPU acceleration is used during training, then the sparse adjacency matrix must be placed in the GPU memory. But this demand cannot be satisfied for the big picture. For example, the size of the sparse adjacency matrix of the largest ogbn-papers100M dataset in the OGB node classification dataset exceeds 50GB. Currently, only the Tesla A100 with 80GB video memory can store it, and this does not include the space for storing features and models. Therefore, if one wants to apply traditional GNN models such as GCN [1] on very large-scale graphs, the only feasible way is to use distributed training.

1.2 Communication problem of distributed training (High Communication Cost)

picture

Figure 2 Speedup and training bottleneck of two-layer GraphSAGE when increasing the number of workers

Each feature propagation of GNN needs to pull neighbor features. For k-layer GNN, the neighbor node features within k hops that each node needs to pull increases exponentially with the number of layers. For dense connected graphs, each node almost needs to pull the node information of the whole graph every time it is trained, resulting in massive communication overhead. Although there are some sampling algorithms that alleviate this problem, they do not solve it very well. Let's take the two-layer GraphSAGE[2] as an example. Figure 2(a) shows that when the number of workers is increased, the speed-up ratio is far from the ideal situation. This is because when the graph nodes are divided into multiple workers, it is necessary to The frequency of pulling neighbors on different worker nodes also increases significantly. As shown in Figure 2(b), as the number of workers increases, the communication overhead will be significantly greater than the computational overhead. Therefore, how to support distributed training on very large graph data has always been a research hotspot.

2. Low Flexibility

2.1 Constrained nonlinear transformation depth equal to feature propagation depth

As shown in Figure 3, for a common coupled graph neural network, it enforces nonlinear transformation in each layer followed by feature propagation, and GCN degenerates into MLP after feature propagation is removed. Our previous evaluation work on deep GNNs [3] showed that graph neural networks have two depths: the depth of feature propagation and the depth D_pof nonlinear transformations D_t, which lead to over-smoothing and model degradation problems in GNNs, respectively. In addition, this work also points out that we need to increase when the graph is sparse (labels, features or edges are sparse), and when D_pthe graph size is large D_t. Assuming that the graph is sparse and small in scale, at this time we need large D_pand small D_t, forcing constraints D_p=D_twill cause sub-optimal results.  

picture

Figure 3 Schematic diagram of the relationship between GCN and MLP

2.2 Node specificity is not considered

picture

Figure 4 Example of receptive field expansion speed

picture

Figure 5 The relationship between the prediction accuracy of different nodes and the number of feature propagation steps

We illustrate this problem with a simple example. In Figure 4, there is 1 node marked blue and located in a relatively dense area of ​​the graph, while the green node is located in a relatively sparse area of ​​the graph. The same feature propagation operation is performed twice. It can be seen that the receptive field of the blue node in the dense area is relatively large, while the receptive field of the green node in the sparse area only contains 3 nodes. This example vividly illustrates that the expansion rates of the receptive fields of different nodes are quite different. In GNN, the size of the receptive field of a node indicates how much information it can capture in the neighborhood. If the receptive field is too large, the node may capture the information of many irrelevant nodes. If the receptive field is too small, the node cannot capture enough information. Neighborhood information gets high-quality node representation. Therefore, for node features with different feature propagation steps, the GNN model needs to adaptively assign weights to different nodes. Otherwise, it will be impossible to take into account the nodes located in the dense area and the sparse area, resulting in the inability to obtain a high-quality node representation in the end.

In order to verify our conjecture, we randomly selected 20 points on the validation set and test set of the commonly used Citeseer dataset, and used the SGC[4] model for node classification tests under different feature propagation steps, recording continuous running The proportion of correct predictions in the total number of times in the 100 times, the experimental results are shown in Figure 5. From the experimental results, we can see that the prediction accuracy of different nodes has different trends with the increase of feature propagation steps. Some nodes, such as node 10, reach the maximum prediction accuracy when the number of propagation steps is 1, and then the prediction accuracy continues to decrease; while for other nodes, such as node 4, the prediction accuracy varies with the number of feature propagation steps. increasing and increasing. This strong difference validates our conjecture that different weights should be given to the weighted summation of node features with different propagation steps.

3. Scalable GNN research

Sampling: Sampling is the most common method for scaling graph neural networks, and it is also widely used in several distributed graph neural network systems, such as PyG[5], DGL[6] and AliGraph[7]. From the perspective of the types of sampling algorithms, it can be divided into three types: graph-based sampling (such as GraphSAGE[2] and VR-GCN[8]), layer-based sampling (such as FastGCN[9] and AS-GCN[8]) 10]), and node-based sampling (such as Cluster-GCN [11] and GraphSAINT [12]). Since the sampling and optimization directions of this work are two different directions, they will not be repeated here.

Model Decoupling: Recently, studies have pointed out [4] that the performance of GNN is mainly due to the feature propagation operation in each layer, rather than the nonlinear transformation operation. As a result, a series of decoupled GNNs were born. They decoupled the feature propagation operation and the nonlinear transformation operation when designing the GNN model, and completed all the feature propagation operations in advance during preprocessing, so there is no need for model training. It takes time and effort to propagate features. Therefore, in terms of scalability, as long as the sparse matrix dense matrix multiplication in the feature propagation operation can be completed during preprocessing, subsequent model training can be easily performed on a single machine and single card. Since the feature propagation operation is done in the preprocessing process, there is no need to use the GPU for acceleration, and only the CPU can be used for the matrix multiplication operation. In our experiments, feature propagation on the 100-million-scale dataset ogbn-papers100M requires about 250GB of memory, which is much easier to meet than 50GB of video memory. The design of decoupling the feature propagation and nonlinear transformation operations in the traditional GNN model not only greatly improves the scalability of the model, but also greatly improves the efficiency of the model.

But most decoupled GNNs do not discriminate the specificity of different nodes during training. For example, in the SGC [4] of the k layer, all nodes only consider the feature of the kth step. Although S2GC [13] and SIGN [14] consider the node feature representation of different feature propagation steps, they only average or concatenate these features, so that when the steps are too large, there will still be an over-smoothing problem. In addition, in the current SOTA decoupled GNN model GBP [15], the author first uses the feature propagation operation to obtain feature matrices with different propagation steps, and then uses an artificially designed weight to do a weighted sum of these feature matrices. Although GBP can better utilize the node feature representation of different feature propagation steps, it does not take into account the specificity existing between different nodes, and simply gives all nodes the same weight for summation. Similar to our idea, SAGN [16] also proposes to use the attention mechanism to give different weights to different nodes to fuse the propagation features of each step, but we additionally consider the model degradation and the impact of different attention mechanisms. In addition, there is a class of decoupled models that do nonlinear transformation first and then do feature propagation, such as APPNP [17], AP-GCN [18] and DAGNN [19]. Compared with APPNP, AP-GCN and DAGNN both consider the specificity of different nodes, but they need to pull the representation of neighbor nodes after nonlinear transformation in each epoch, which cannot be predicted like the previous decoupling method. All feature propagation operations are handled, so they still suffer from low scalability.

4. Graph Attention MLP

Our model is divided into two branches: (1) As shown in Figure 6, the first branch obtains node features with different feature propagation steps during preprocessing, and uses the attention mechanism to assign different weights to different nodes, and finally The fusion feature matrix is ​​obtained by weighted summation H. In addition, the initial residual connection (Initial Residual) is also used to solve the model degradation problem caused by deep MLP; (2) The second branch applies a label propagation algorithm to the training set node labels during preprocessing, and transfers the training set node label information. Extend to the full graph to get a kstep-propagated label matrix \widehat{Y}^{(k)}. HThe results of these two branches \widehat{Y}^{(k)}are input into two different MLPs respectively, and the results of the two MLPs are simply added to obtain the final low-dimensional node embedding of the model \widetilde{H}. In terms of the choice of loss function, our model uses the cross-entropy loss function commonly used in node classification problems.

picture

Figure 6 Schematic diagram of GAMLP feature branch

4.1 Three different attention mechanisms

In order to enable each node in the graph to adaptively select the features of different receptive fields, we design three different types of attention mechanisms to adaptively assign different node features for different feature propagation steps to different nodes. Weights. Their difference is mainly in the choice of reference vector:

  • Smoothing attention mechanism (Smoothing): The node features that have been propagated infinitely are X^{(\infty) }_iused as reference vectors E_i. In the actual calculation, there are formulas that can be directly brought into the numerical solution, and there is no need to really perform infinite feature propagation. The attention mechanism hopes to learn the relative X^{(\infty) }_idistance of node features with different propagation steps, and use this distance to guide the selection of weights. 

  • Recursive attention mechanism (Recursive): Using the recursive calculation method, when calculating the weight assigned to lthe node features of the step-by-step propagation , the fused node features of the previous step are used as reference vectors . The attention mechanism hopes to learn the information gain of the fused node features compared to the previous step , and use this information gain to guide the selection of weights. X^{(l)}_i(l-1)E_iX^{(l)}_i(l-1)E_i

  • Knowledge jump attention mechanism (JK): As shown in Figure 7, we spliced ​​all the node features obtained in the preprocessing stage through feature propagation in columns, and passed this vector through an MLP, taking the output of the MLP as a reference vector E_i. The reference vector in the attention mechanism E_icontains the information in all Kfeature-propagated node feature matrices. The attention mechanism hopes to learn the importance of node features with different propagation steps relative to the large vector E_i, and use this importance to guide the weight selection.

After obtaining the reference vector E_i, use the conventional attention calculation mechanism to calculate the weights given to lthe node features through step propagation X^{(l)}_i:

picture

where sis the learnable vector, which \deltain our model is the sigmoid function.

After obtaining the weight of each node for different feature propagation steps, we obtain the final fusion feature matrix by weighted summation H.

picture

Figure 7 Illustration of knowledge jump attention mechanism

4.2 Using training set label information

In order to make full use of the information of the nodes in the training set, we not only use the labels of the training set for the calculation of the loss function during the model training process, but also let them participate in the forward of the model as another dimension of the nodes. Specifically, we use the label propagation algorithm (Label Propagation) to propagate the information of the node labels of the training set to the whole graph, and obtain kthe node label matrix that has been propagated step by step through the label propagation algorithm \widehat{Y}^{(k)}.

4.3 Model training

During model training, we will fuse the feature matrix Hand node label matrix \widehat{Y}^{(k)}through two different MLPs, and add the results to get the final node low-dimensional embedding \widetilde{H}. Among them, the number of MLP layers designed for the fusion feature matrix is ​​relatively deep, and X^{(0)}the skip connections of the original node features are added to reduce the negative effect of the model degradation problem on the deep neural network; while \widehat{Y}^{(k)}the number of MLP layers designed for the node label matrix is Shallow, the main function is to \widehat{Y}^{(k)}map to the space where the fusion feature matrix His ​​located. During training, we used the conventional cross-entropy loss function for node classification problems.

4.4 RLU (Reliable Influence Utilization), trusted label utilization

To further improve the performance of our model, we divide the whole training process into several consecutive stages, and the prediction results of the previous training stage can provide additional input features and supervision signals for the later training stage. The use of trusted labels first requires determining a set of high-confidence nodes. We set a hyperparameter \epsilonas a threshold here. If the predicted probability of the node's class in the prediction result of the previous training stage is greater than the threshold \epsilon, then this node will be is added to the set of high-confidence nodes. 

It is worth noting that we only select nodes in the validation set and test set when screening, because the nodes in the training set are already guided by strong ground-truth label information during the training process. The utilization of high-confidence node predictions from the previous training stage is split into two parts in our model.

The first part is the reinforcement of the input features of the second branch of the model. AND^{(0)}From the second stage, we fill the matrix with the soft prediction results of high-confidence nodes . \widehat{Y}^{(k)}In this way, the prediction information of the high-confidence nodes of the model in the previous training stage is added to the final result, and the input features are enhanced.

The second part is reinforcement during model training. Starting from the second stage, when the model is forward, we simultaneously input the training set nodes and the high-confidence nodes of the previous training stage. The training set nodes only participate in the calculation of the cross-entropy loss function, and for these high-confidence nodes, We compute the KL divergence between the output of the current model and the soft prediction results of the previous training stage for these nodes.

The newly added KL divergence loss function is to distill the prediction information of these high-confidence nodes in the previous training stage into the current model, enhance the prediction ability of the current model for these nodes, and improve the model in the validation set and test set. forecast accuracy. The results on three large datasets show that the trusted label utilization mechanism we add can effectively improve the performance of the model on validation and test sets.

5. Experimental results

5.1 Dataset Description

picture

This time, Tencent participated in the most challenging and competitive node nature prediction task in the three OGB tracks. In addition, ogbn-papers100M, ogbn-products and ogbn-mag are also the top three largest in the five data sets of the track, which are of great business value. Among them, ogbn-papers100M is the largest known open source graph dataset, with more than 110 million nodes and 1.6 billion edges, and ogbn-mag is the largest heterogeneous graph in the track.

5.2 Analysis of experimental results

picture

picture

picture

It can be seen that our proposed GAMLP+RLU achieves Top #1 results on the three largest OGB classification datasets. In particular, our performance on heterogeneous graph ogbn-mag significantly outperforms the second place with a 1.5% lead.

6. Application prospects

6.1 Solving the large-scale graph training problem

Similar to SGC, our proposed GAMLP enables a single machine to train on very large graph data with more than 100 million nodes and a billion edges. In the distributed training process, we do not have the communication overhead of pulling features, and only need to exchange gradient information during the training process. With the increase of the number of workers, the speedup ratio similar to that of MLP can be achieved, which is very suitable for ultra-large-scale graph data application scenarios in the industry.

6.2 Solving the graph sparsity problem

picture

Figure 8 Performance of SGC at different depths with feature/edge/label sparse

Our proposed GAMLP is well suited for graph sparse scenarios that exist widely in the real world. As shown in Figure 7 in our previous deep GNN evaluation work [3], when the graph has sparse features (such as some new users in the recommendation have no feature information), labels are sparse (many nodes have no labels) and edges are sparse (in the graph When there are very few edges), we need more deep graph structure information, and GAMLP can indirectly adjust the most suitable depth for each node by assigning weights to node features with different feature propagation steps.

7. About us

7.1 Angel Graph graph computing team

picture

Figure 9 Angel Graph system architecture

Angel Graph is a high-performance graph computing framework developed by Tencent. It absorbs the advantages of Angel parameter server, Spark and PyTorch, making traditional graph computing, graph representation learning and graph neural network "trinity", achieving high performance, high reliability, and ease of use. A large-scale distributed graph computing framework.

Angel Graph has the following core capabilities:

  • complex heterogeneous network. The composition of graph data in the industry is complex and diverse, and the data scale often has billions of vertices, tens of billions or even hundreds of billions of edges. Angel Graph uses Spark On Angel or Pytorch for distributed training, which can easily support large-scale graph computing with billions of vertices and hundreds of billions of edges.

  • End-to-end graph computation. The big data ecology in the industry is mostly Spark and Hadoop. Angel Graph is based on the Spark On Angel architecture and can be seamlessly connected to Spark to leverage Spark's ETL capabilities to support end-to-end graph learning.

  • Traditional graph mining. It supports traditional graph algorithms with billions of vertices and hundreds of billions of edges, such as Kcore and PageRank to analyze the importance of nodes, and Louvain for community discovery. Provides metric analysis of vertices and rich graph features for application in business models such as machine learning or recommendation risk control.

  • Diagram representing learning. Supports Graph Embedding algorithms of billions of vertices and hundreds of billions of edges, such as LINE, Word2Vec, etc.

  • Graph Neural Networks. It supports graph neural network algorithms with billions of vertices and tens of billions of edges, and uses the rich attribute information on vertices or edges for deep learning.

7.2 Peking University-Tencent Collaborative Innovation Lab

The Peking University-Tencent Collaborative Innovation Lab was established in 2017. It mainly carries out cutting-edge exploration and talent training in the fields of artificial intelligence and big data, and builds an internationally leading school-enterprise cooperative scientific research platform and an industry-university-research base.

Through cooperative research, the Collaborative Innovation Laboratory has made important achievements and progress in theoretical and technological innovation, system research and development and industrial applications. It has published more than 20 academic papers in top international academic conferences and journals. In addition to the cooperative research and development of Angel, the laboratory also Several open source systems have been independently developed, such as:

Black box optimization system OpenBox

https://github.com/PKU-DAIR/open-box

AutoML System Mindware

https://github.com/PKU-DAIR/mindware

Distributed deep learning system Hetu

https://github.com/PKU-DAIR/Hetu

To learn more about GAMLP, please visit the link below↓

GAMLP paper:

https://arxiv.org/abs/2108.10097

GAMLP code:

https://github.com/PKU-DAIR/GAMLP

Angel project code:

https://github.com/Angel-ML/angel

This work will be deployed on Angel Graph soon, and everyone is welcome to try it!

References:

[1] Thomas N. Kipf, and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017.

[2] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representationlearning on large graphs. In NIPS. 1024–1034.

[3] Zhang W, Sheng Z, Jiang Y, et al. Evaluating Deep Graph Neural Networks[J]. arXiv preprint arXiv:2108.00955, 2021.

[4] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr., Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. ICML 2019.

[5] Fey, Matthias and Lenssen, Jan E. Fast Graph Representation Learning with {PyTorch Geometric. ICLR workshop 2019.

[6] Wang M, Yu L, Zheng D, et al. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs[J]. 2019.

[7] R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai, Y. Li, and J.Zhou,“Aligraph: A comprehensive graph neural network platform,”Proc. VLDB Endow., vol. 12, no. 12, p. 2094–2105, Aug. 2019.

[8] J. Chen, J. Zhu, and L. Song. Stochastic training of graph convolutional networks with variance reduction. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018.

[9] J. Chen, T. Ma, and C. Xiao. Fastgcn: Fast learning with graph convolutional networks via importance sampling. In 6th International Conference on Learning Representations, ICLR 2018.

[10] W. Huang, T. Zhang, Y. Rong, and J. Huang. Adaptive sampling towards fast graph representation learning. In NIPS 2018.

[11] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In SIGKDD 2019.

[12] H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. K. Prasanna. Graphsaint: Graph sampling based inductive learning method. In ICLR 2020.

[13] H. Zhu and P. Koniusz. Simple spectral graph convolution. In ICLR, 2021.

[14] E. Rossi, F. Frasca, B. Chamberlain, D. Eynard, M. Bronstein, and F. Monti. Sign: Scalable inception graph neural networks. arXiv preprint arXiv:2004.11198, 2020.

[15] Ming Chen, Zhewei Wei, Bolin Ding, Yaliang Li, Ye Yuan, Xiaoyong Du, Ji-Rong Wen. Scalable Graph Neural Networks via Bidirectional Propagation. NeurIPS 2020.

[16] Sun C, Wu G. Scalable and Adaptive Graph Neural Networks with Self-Label-Enhanced training[J]. arXiv preprint arXiv:2104.09376, 2021.

[17] J. Klicpera, A. Bojchevski, and S. Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In  ICLR 2019.

[18] I. Spinelli, S. Scardapane, and A. Uncini. Adaptive propagation graph convolutional network. 355 IEEE Transactions on Neural Networks and Learning Systems, 2020.

[19] M. Liu, H. Gao, and S. Ji. Towards deeper graph neural networks. In SIGKDD 2020.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324094601&siteId=291194637