论文阅读TLPGNN: A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on GPU

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right


Article address: https://dl.acm.org/doi/10.1145/3502181.3531467

background

This is a paper published at HPDC2022 by the team of Qiang Fu, Yuede Ji, and H. Howie Huang. It mainly talks about the strategy of dividing GNN nodes to warp in universities within the GPU.
The training process of graph neural network is the process of continuous update of vertex features, and different models adopt different update strategies.
In this paper, all operations before the activation function are collectively referred to as graph convolution. Obviously, graph convolution occupies most of the calculation time of graph neural network, so this paper focuses on optimizing the calculation mode of this process.

Graph neural network


1. A brief introduction to graph convolutional neural networks

The main operations include (1) neighborhood aggregation (Aggregation); (2) neural network operation (NN Operation); (3) weighted summation (Reduction)
GCNofficialnote

2. Motivation

1. Brief description

There are many differences between the calculation of the graph neural network and the traditional graph processing (graph processing), such as the size of the vertex feature dimension. Therefore, the calculation mode of traditional graph processing cannot be directly applied to the calculation of graph convolution.
Although there are many related works, such as: DGL, Pytorch PyG and other large-scale open source graph neural network frameworks, considering the above points, they have designed optimization strategies for various computing modes, but their optimization strategies do not fully consider hardware (such as GPU) The impact will cause problems such as low resource utilization.
Therefore, this paper first selects indicators, analyzes the existing computing modes, mines the main factors that affect the performance of graph convolution operators, and draws some principles that should be followed when designing graph convolution operators. Based on these findings, a new graph convolution operator is designed.

2.Profiling of Cuda C

Atomic Operation (atomic operation): In CUDA programs, atomic operations are used to avoid competition caused by multiple threads writing to the same memory address at the same time. Using atomic operations can ensure that only one thread is writing at the same time (similar to a lock mechanism).

atomic lockint atomicAdd(int* address, int val);
This line of code means that the kernel function needs to lock the GPU's video memory block to perform addition when performing operations, and this memory segment cannot be shared.
In the graph neural network programmed by CUDA C, there are four methods for operating vertices:

  • Push: Each vertex writes features to all its neighbors in parallel
  • Edge-centric: Each edge writes features from the start point to the end point in parallel
  • GNNAdvisor: Use atomic operations to process batch nodes
  • Pull: Each vertex reads the characteristics of all adjacent nodes, and then performs parallel processing

The following table represents the cost comparison of various operations:
cost comparison

3.Profiling of Coalesced Memory Access

Coalesced Memory Access (Coalesced Memory Access): In CUDA programs, combined memory access means that threads in the same Warp (32 threads) access continuous memory addresses. In GPU: the minimum granularity of reading and writing global memory = 32 Bytes = 1
sector
Aligned and continuous 32Bytes read and write can be combined into one memory transaction and sent to GPU LD/ST Unit
Continuous address storageCoalesced Access
4 sectors
4 32-byte memory transactions
100% efficiency
non-consecutive address storageStrided Access
32 sectors
32 32-byte memory transactions
12.5% ​​efficiency

In traditional graph processing, the feature of each vertex is a scalar value (that is, an int or float type), and the graph structure is irregular, and the randomness of memory access is very large, so it is difficult to achieve combined access.
The vertex feature dimension in GNN is relatively large, and the feature of each vertex is a Tensor array, which is guaranteed to be continuous in memory, providing an opportunity to implement a more combined memory access mode.
Comparison of different storage methods
comparison table
The benefits of using the same warp (runtime is very small) can be found from the above table. So coalescing memory accesses improves performance -> make sure to coalesce global memory accesses whenever possible.

4.Profiling of Kernel Launch

Kernel Launch (kernel function startup): More kernel functions mean higher kernel function startup overhead and more memory usage. Every
function defined in cuda c programming is called a kernel function kernel. Existing large-scale open source frameworks (such as DGL) have a large number of kernel functions because they are based on Pytorch and use third-party libraries such as CuSparse. The startup kernel function has the following characteristics:

  • Kernel launch time ≈ Runtime – GPU time The more kernel functions, the longer the startup time
  • There are cases where the latter kernel function needs to use the calculation result of the former kernel function.
  • The more kernel functions, the more intermediate results need to be stored, and the more video memory is used

Start kernel function comparison statisticsTherefore, in order to speed up GCN, the graph convolution of gnn should be implemented with as few cores as possible.

3. Design

1. The vertices are assigned to an entire warp

Architecture Design 1

  • One thread processes one edge: Although it can avoid workload imbalance, it will introduce atomic operations and bring huge overhead
  • One thread processes one vertex: if-else branch is introduced to generate Warp divergence; non-merged memory access (Strided
    Access) leads to low memory access efficiency
  • Multiple warps process a vertex: Introduce additional overhead for synchronous operations between multiple warps, resulting in reduced computational efficiency

Design points: Mapping each vertex to a warp. A Warp processes the graph convolution process of a vertex in parallel. Avoid the overhead of synchronous operations, eliminate warp divergence, and easily implement merged memory access.

2.Warp Divergence

Problems caused by conditional branch statementsAs can be seen from the above figure, threads that are not in the branch actually do not work, resulting in reduced resource utilization. Therefore, at design time, some optimizations can be made for conditional branch statements: Feature Parallelism. (feature parallelism, parallel mode of threads in Warp).
Loop scheme: There are two possible ways. Cyclic feature dimension, processing the same feature dimension of different neighbor nodes in parallel in each loop; looping neighbor nodes, processing different feature dimensions of a neighbor node in parallel in each loop.
two-stage parallelismDo not use cyclic feature dimensions:

  • Threads in a Warp always process the same feature dimension at the same time, which means they always update the same feature dimension, that is, access the same address, which will introduce atomic operations
  • Introduce atomic operations, for example: AtomicAdd
  • Threads in Warp access data distributed in global memory scattered addresses (scattered addresses)
  • Fetch patterns that result in non-coalescing

Loop scheme: Loop neighbor nodes, and process different feature dimensions of a neighbor node in parallel in each loop.
stage 2Using circular neighbor nodes:

  • Threads in a warp process different feature dimensions at any time, which means they always update different dimensions of features
  • Avoid reducing computational efficiency by introducing atomic operations
  • All threads within a warp process continuous dimensions of a single neighbor feature
  • Implement merged memory access mode

3.Hybrid Workload Balancing

Node parallelism can lead to unbalanced workload distribution.
Hardware-based assignment: Each warp is responsible for one node, using the same total number of Warps as the number of input graph nodes.
Problem: A large number of Warps and Blocks are generated, which is too dependent on the scheduling of Block scheduler (closed source hardware), which is opaque; it cannot handle large images (the maximum number of threads is limited by the hardware) Software-based assignment: Similar to task pools, each warp is processed every
time A fixed number of nodes, if the pool is not empty after processing, continue. The total number of Warps can be less than the number of nodes, and will not be idle
Heuristic hybrid assignment:
Software: vertices > 1M or average degree > 50
Hardware: otherwise

hybrid workload balancing## 4. Kernel fusion and register caching
Encoding uses only two kernel functionsAs shown in the figure, only two kernel functions are used for encoding

Avoid cache false sharingAvoid cache false sharing

4. Experimental results

Contrast 1result 2result 3
In contrast, there is naturally a good performance acceleration effect.


Supongo que te gusta

Origin blog.csdn.net/weixin_43934886/article/details/130163251
Recomendado
Clasificación