Technology Trends | Flying Paddle Diagram Learning Large Model Training Framework

Reprint public account | DataFunSummit


This article mainly introduces the large model training framework for paddle graph learning.

The introduction will revolve around the following three points:

1. Development trend of graph learning framework

2. GPU Acceleration of Graph Learning Framework

3. Industrial-scale large-scale graph representation learning PGLBox

Sharing Guest|Huang Zhengjie Baidu Senior R&D Engineer

Editing|Wang Chao Zhaopin Recruitment

Production Community|DataFun


01

Development Trend of Graph Learning Framework

First, I would like to share with you the background of federated learning.

1. What is a graph? and the source of the graph

934f95ec357576f64859c0d712500e0b.png

First, graphs are a universal language for describing the complex world. Including some of our social networks in the Internet, it is the connection between people. As well as some recent hot fields like biological computing, it is the structure of graphs of some chemical molecules. And like a knowledge graph, it is a connection between some entities and relationships. A recommender system is some association between users and items.

2. The development trend of graph neural network

baa93b986b5a993c80c526a43776629b.png

The concept of graph neural network has also become popular in recent years. When modeling graphs before, it was usually based on the concept of a spectral domain, which is a Spectral-based method. It will first convert some images to the frequency domain through Fourier transform, and then operate on the frequency domain. At that time, it may be because the concept was relatively complicated, and its scalability for large-scale images was relatively poor, so fewer people paid attention to this aspect of work.

Around 2017, after networks like graph convolutional networks and some networks like Message Passing appear, everyone will turn to a simpler architecture to look at graph networks, such as the Spatial-based method, which is to understand graph neural networks in the spatial domain. network. We can compare it to the two-dimensional graph convolution. If you want to model a point element, it is related to the pixels around you. In the graph neural network, if I want to model a point, one of its attributes and its representation, it is usually only related to its neighbors, and then through some neighbor calculations, we can get a representation of our central node.

3. Spatial-based graph neural network

2086012efec45fc1e2283acca591c93c.png

In the spatial domain-based graph neural network, we usually focus on two points, one is how do we aggregate the neighbors to get the representation of the center point; the other is how we can get the whole graph through the aggregation of nodes when we have a large graph expression of the structure.

4. Graph neural network based on message passing

6b07ba2153a2061e3be4460806909517.png

The current mainstream graph neural network can basically be described as this form called message passing, that is, Message Passing. Currently popular graph neural network frameworks like DGL and PYG, including our PGL, also use this message-based paradigm to define graph neural networks.

The Message Passing here means that we only need to model how the neighbor sends the message from the source node to a target node, and what the target node does after obtaining the message from the source node, such as doing a weighted summation or doing a mean. Through such a framework, we can basically implement most of the graph neural networks, including the current Transformer. Assuming that the Transformer is understood as a fully connected graph neural network, in fact, the structure of the Transformer can also be passed through this message. form to realize.

5. Paddle chart learning framework PGL

0a55a81897afc290fc7707a03b9a7973.png

The picture above is a framework PGL for our graph learning. It was released about two years ago. It is mainly based on the PaddlePaddle deep learning framework. On it, some engines including graphs and some programming interfaces of graph networks are built. We also put a lot of preset models on it. For everyone to use. Interested students can go to our Github, install our framework with pip install, and run some examples.

6. Standard process for training large-scale graph neural networks

d121eb26c424c40b4df7d29df2174237.png

There is currently a relatively mature standard process for large-scale training of graph neural networks. Especially in the case of large graphs, we usually use a paper called GraphSAGE in 2017. Its main idea is to train some labeled nodes in the large graph, and we usually sample the neighbors of a certain point. sample, the latter uses aggregation to represent a point in our center by continuously aggregating or passing a graph neural network through messages.

For different frameworks, we will mainly solve three problems. The first is how our graph is stored , and the second is how our subgraph is sampled , including how to extract some features from the graph. Finally, how are some message passing algorithms of high-level graph networks defined .

7ff07ace744ca995d4bbd9d2dd3812f0.png

Compared with some large models, most of the consumption in the graph network is on the storage of graphs and the extraction of features. Compared with the parameters of some large models just mentioned, the current model is relatively small in pure network modules, but even in this case, the GPU utilization of our graph neural network training is still very low. The biggest bottleneck is the bottleneck of sampling on the CPU and the bottleneck of our data transmission from the CPU to the GPU.

Because our graph is very large, it usually cannot be stored on the GPU. We will implement the graph and some sampling algorithms in memory, and after the final sampling, we will move it to our GPU for calculation.

7. Changes in hardware: video memory is getting bigger and bigger

4bda46d74443df94a617faf13016af51.png

With the gradual development of hardware, we will also find a trend, that is, the current video memory is getting bigger and bigger, including 80G of 8 A100s, we can assemble a 640G video memory on a single machine. So what kind of concept is this? Let me give you an example, which is the largest heterogeneous graph data set in existence. The largest graph in Open Graph Benchmark is called MAG240M, which contains some paper authors citing the network, which is about 200 million points and 1 billion edges. .

The biggest thing in it is the 310G feature, which is the text feature extracted by BERT. Each text has 768 dimensions, and there are about 300 G features here. But in such a large case, our A100 80g, if we put 8 cards together, can already exceed the capacity of such a data set. So can we do something like this? It is because the GPU storage is so large, can we completely GPUize the entire process of our graph algorithm to meet most scenarios. Therefore, we are also thinking about how to accelerate the GPU for our graph learning framework.

02

GPU Acceleration for Graph Learning Frameworks

1. GPU Acceleration of Graph Learning Framework

a6e1454d4dfb80cc649cfec8d3e7e404.png

When it comes to speeding up, it's hard to achieve with large graphs. Then let's transfer the concept to the case where the graph is very small, what will we do? When the graph is very small, a very direct way is to completely stuff the structure of the graph into the video memory. Then, through some graph sampling operators implemented in the video memory, we can move all the graph storage and sampled data streams in our graph learning algorithm to our GPU to achieve extreme speed.

But the biggest problem here is that when we actually do it, our picture is still quite large. If it is a graph structure of about 200G, a single-card GPU can no longer do it.

2. GPU Acceleration of Graph Learning Framework: UVA Mode

A compromise is whether we can use GPU sampling operators to access the graph data in memory to expand large graphs. This work will be in 2021. Around June and July, a team called Torch-Quiver did a thing, which is to use the memory as a piece of video memory, use a mode called UVA, and use the sampling operator of the GPU to directly Memory access for sampling. The effect achieved is that it will be much faster than pure CPU sampling, and slower than GPU memory storage, but the image can already become very large.

c5804e7dca41076bcb0f8131a808e08d.png

Their specific approach is to store the graph structure of the GPU in the page-locked memory. Through a Unified Virtual Addressing technology called unified address, the memory is used as the video memory, and the Kernel sampled by the GPU can directly access a unified address. They gave a set of data, that is, when using UVA for sampling, it will be nearly 20 times faster than pure CPU. 

ddf4235a60bd85eb40e5ed77331581c2.png

3. GPU Acceleration of Graph Learning Framework: Subgraph Feature Extraction

In addition to the single card with memory in UVA mode, if we use multiple cards, we just mentioned that our 8 A100s can reach 640G. We are also thinking about whether we can use the NVLink technology for communication between GPUs, and use the communication between GPUs to transmit large bandwidth to replace the transmission from memory to video memory.

0589b5e06e5875a2a7d59022c838f391.png

Here we store the largest piece of features in the graph in the form of column cutting. For example, there is a 300G graph feature storage on our left, and we cut it to each card, and each card consumes about 40G of video memory. Then we can store an entire MAG240 plus graph structure on 8 A100s to form a large shared Tensor.

b2c50bd3c58042d79596b3ab6df90044.png

In the actual training, I use two GPUs as an example here. When we are doing data parallelism, each GPU is only responsible for the training of a certain part of the nodes. When sampling its subgraph ID, we can use it once by column The All-gather, a piece of sub-graph corresponding to all the points is obtained through All-gather, and then some features are obtained on the local column-wise segmentation features. After acquisition, use All2all to exchange features to restore the structural features of the entire graph. Through such All2all, the whole feature is segmented, and GPU is used for transmission. At present, we can put R-GAT on MAG240 for half a minute to finish training an epoch, compared to the previous one compared to CPU or DGL. CPU mode will be faster.

2f40961ff8003e7ca92ed1d25ceda762.png

In the 2021 KDD competition, we proposed a method called R-UniMP, which has done a lot of optimization for heterogeneous graph sampling, as well as optimization for some label applications in the graph. At that time, our model was very complicated. At that time, it took us about 40 hours to train 100 epochs to get the SoTA score. This year, through a series of optimizations, we can produce the best model last year in 1.1 hours on the 8-card A100, which also provides us with a lot of room for algorithm research to participate in the NEURIPS22 competition this year.

03

Industrial-scale large-scale graph representation learning PGLBox

1. The application of graph representation learning in Internet products

The scenes just mentioned are all about how to train big pictures in the academic world. Generally speaking, the academic community may focus on node classification. The data structure of node classification is very neat and does not require large-scale node features.

f6a0e673da3e61cab9a107e508697cba.png

In industrial scenarios, especially Internet product applications, when we make recommendations, we usually face a problem, which is to model our users and items. There are two commonly used algorithms in recommendation, one is Item-based collaborative filtering , which is to measure some similarities between two items, and  User-base collaborative filtering , which is to measure the similarity between two users.

0e7fa8417d19dc0794150ee01f8d78b7.png

Usually, some applications of graph learning on the Internet are mostly based on such a situation, that is, we will construct a large graph of users and items in a large number of logs, including social relations, some behaviors of users, and items. some associations. We use some methods of graph representation learning to give each point a vector. In this vector space, distance can be used to measure, that is, some similarities between different items and users.

1d7b789efcb2569e677e2f0d5b57d169.png

Some traditional algorithms, including matrix decomposition, can also be represented by such vectors, and algorithms such as Deepwalk and Meta-Path a few years ago can also construct graphs through random walk and Word2vec algorithms. representation. This year, the relatively popular method is to use graph comparison learning method to extract user and item behavior representations from two sub-graphs through graph data enhancement and comparison learning.

2. As a framework for graph learning, it needs to meet the needs of business research

As a graph learning framework, the most important thing is to meet some needs of our business research. We have also sorted out most of the current graph learning methods and found that they can basically be abstracted into the following four steps.

04adc529ffd4c80775682c56aa981d72.png

The first is that the graphs you choose may be different. For example, a session graph of an item is an isomorphic graph, and a user item may be a bipartite graph. For some heterogeneous graphs, there may be click relationships, preference relationships, or some purchase relationships in the same graph, which is a complex form of heterogeneous graphs.

After having a graph, if we do some self-supervised or unsupervised tasks on the graph, we usually use some random walk methods to generate some training samples.

After we have samples, we will use different forms to express each point. For example, we will sample different neighbors, including a well-known algorithm called PinSage, which is to select neighbors through random walk. Different algorithms will have own definition.

After the data is formed, the last is the structure of our graph network, which is how we choose different structures to model neighbors and our own expressions.

At present, basically all unsupervised graph learning methods can be abstracted into these four processes.

3. Use configuration to complete industrial-grade graph representation learning and training

921b58339514782b62e73698a6a35a1e.png

In 2021, within our PGL, we provide a set of tools called Graph4Rec, which can be configured through five items, that is, configure the graph data, configure the random walk samples of the graph, the sampling algorithm, and generate positive and negative samples method, as well as network selection, a set of representation learning training can be automatically generated, and high-quality embedding of each node can be obtained. But our previous method was based on distributed CPU training.

4. Pain points in large-scale application scenarios

593a7ff0a73ee79c1348fcac544e3b6e.png

In large-scale application scenarios, there are some pain points like now . For example, we usually have some hundreds of billions of sparse features, including the IDs we use to store some users, graphic IDs, and some discrete graphic videos. properties. In this case, we usually need a large-scale parameter server, and we need to perform a large number of embedding queries and interactions with the parameter server during training.

In 2021, we will still use a CPU parameter server plus a CPU graph engine to cooperate with a CPU MPI cluster for training. Its biggest problem is that the modeling of different modes is becoming more and more similar at this stage, including some structures of Transformer, which cannot satisfy some complex models.

The second is that when our graph samples a lot of neighbors, when the embedding and graph sampling are pulled, it has a lot of communication. We also often find that, for example, when we set up 20 machines for training, as long as one of the machines is disconnected, the training will easily fail.

Some people may also say whether it is possible to use the CPU parameter server to add the graphics engine to cooperate with the GPU for training to meet the needs of some complex models just mentioned. But we will also find that IO is very intensive, that is, the communication of some parameter servers will also cause the overall GPU utilization to be very low.

5. PGLBox: a full-process GPU-based large-scale graph learning engine

b779753d2341e744a8cf68a2dbe15aeb.png

Continuing what we said at the beginning, since the current GPU server has, for example, 8 A100s, we have a video memory of 640G, and even the memory may already have three T, then can we do a full process? GPU-based large-scale graph learning engine? It is to put our graph structure, graph feature table, and our parameter server into the GPU of a machine, that is, through the sub-card of the GPU, all of them are stuffed into our GPU. Then cancel all the cross-machine communication we just mentioned. In this case, 640G can already support a large order of magnitude of business. So here we also implemented a large-scale graph learning engine optimized for the whole process of GPU, which is also for the convenience of some of our businesses.

Because it is different from the academic world, everyone's data is very regular. Every time our business comes, the user gives two IDs, and we have to compose pictures or do queries. Therefore, we have also made a lot of GPU hash tables above the ease of use to support hash composition and hash graph query, and accelerate the sampling of such complex heterogeneous graphs through GPU. And we have done a lot of gradient optimization strategies for parameter servers on the GPU to reduce the use of video memory.

6. PGLBox: Support tens of billions of nodes/tens of billions of edges through hierarchical storage and pipeline

6ab94ad5b6ff1baabbb5ce8e8a3d5a5b.png

If the business is bigger, our graph may be larger than 10 billion points or more than 10 billion sides. We also provide another hierarchical strategy, which is to save our entire large graph to an SSD and Go to the memory, when we use a training, then move the corresponding things to the video memory.

The graph here represents the case where we have made a series of samples for learning and training. Usually, a piece of our training is very concentrated, and it is concentrated in a subgraph of a certain block. So here we introduce a concept called Pass, that is, we may have to walk the entire graph in one epoch, and traverse all the points. But one of our passes is to split the whole epoch into several small passes, and do sample walks in the passes, as well as prepare some characteristics of the parameter server. After the preparation, this piece is very suitable to be done directly in the video memory. Then in the video memory, we have been training N batches. After training the entire pass, we synchronize some embeddings and absorbed parameters to an SSD and memory parameter server.

Between different Passes, we can do a parallel pipeline. For example, when we are doing pull embedding and pulling the corresponding embedding from the memory, the sampling of the second pass walk and the generation and analysis of samples can already be done. So through such a pass-level training parallel, even though we have introduced these hierarchical storage, the speed can achieve the same effect as the full video memory mode.

7. Graph Representation Learning Accelerated by PGLBox

2686b9c9636b947addb9aa502b1e788c.png

With such a framework for representation learning of large images of PGLBox. We have also done a lot of business upgrades within the company. Including some original Deepwalk models, which may be completed in 480 minutes, can now be solved within an hour. More complex models, like GraphSAGE, will lead to an exponential increase in the amount of calculations with the number of neighbors we sample. When the subgraph structure increases exponentially, the feature extraction and communication volume also increase exponentially. . In the case of a stand-alone 8-card GPU, these communications can basically be optimized, and a graph network like this complex GraphSAGE can also achieve a 28-fold speedup.

In addition, it is also in line with the trend of the times. Through this global GPU, we can support some complex models. For example, there may be different complex relationships in our application scenarios. Our nodes themselves are multi-modal, which may There are users, texts, and images. Large-scale parameter servers provide modeling of long-term behavioral IDs, and pre-trained models provide understanding of these cross-modal content.

Because of this graph architecture for GPU computing, we can put the large pre-trained models in the past under the same framework as these large-scale parameter servers for training, and do an end-to-end learning, including adjusting the pre-training at the same time Some parameters in the large model and some large-scale ID-based parameter servers.

Through the current framework of PGLBox, we have realized the graph representation learning and training of tens of billions scale on a single machine with 8 cards, which is 28 times higher than the CPU distributed solution.

8. Paddle chart learning framework PGL

38ecf3dd30992ca8fa97231390394844.png

PGLBox is currently open source on github , the specific address is as follows:

https://github.com/PaddlePaddle/PGL/tree/main/apps/PGLBox

04

question Time

Q1: What is the difference between a GPU hash table and a normal CPU hash table?

A1: At present, GPU hash tables may not be as mature as CPU ones, so the current mainstream open source frameworks, such as DGL and PYG, do not do this.

We make this hash table, which is not much different from that of the CPU. The main function is to have such a function to simplify drawing on the GPU.

Q2: Is the Cache way to move some to the GPU in advance? Or put it all on the GPU?

A2: In the Cache method, we move some to the GPU in advance. The data of heterogeneous graphs like MAC240 add up to more than 300 G, because each dimension may have 768-dimensional features.

If it is not so big, for example, we usually do 100-dimensional representation learning, and even in some scenarios of earlier recommendation systems, many people may do 8-dimensional, here can be a very large picture. In this case, the graph can be made very large. So we are basically stuck to a certain extent, a Pass can just fully occupy the video memory, and leave a part of the space to do the matrix calculation of the model, which is basically enough.

Q3: Will the pipeline parallelism between Passes cause the gradient to expire?

A3: In the case of the current pipeline, on the gradient, we will pull things back from the CPU every time in serial, to ensure that when the Pass pulls the embedding, the next Pass to pull the embedding must be after the push of my Pass, so There is no problem here.

Specifically, what can be parallelized is the sample of the walk sampling, which is the composition of the sample, and the sampling of the walk sample, which also takes up a lot of time.

Q4: The problem of multi-machine expansion

A4: If there are 8 cards in a single machine, technologies such as NVLink can be used in the card. For multiple machines, embedding is also involved. For example, some problems of network bandwidth will be included in cross-machine.

That's all for today's sharing, thank you all.

7f3a3c9c2df5b81b9507c5f02706b312.png

Sharing guests

INTRODUCTION

cc87901254d28b4e2a750299fbf2de78.gif

Huang Zhengjie

48605fa1d01885a4367245875a245e03.gif

baidu

5bfe2045a026c66309cbc1346873f881.gif

Senior R&D Engineer

f5fb5d0fa0c5ee09de14a0ba2cbb0cf2.gif

Huang Zhengjie, senior R&D engineer of Baidu's natural language processing department, graduated from Sun Yat-sen University and has long been engaged in research and development in the fields of semantic representation computing and graph learning. The application of core products such as advertising and commercial advertising has landed. He won two track championships in the KDDCUP-2021 graph learning competition, published many academic papers in the top artificial intelligence conferences KDD, IJCAI, etc., and has a number of related patents.


OpenKG

OpenKG (Chinese Open Knowledge Graph) aims to promote the openness, interconnection and crowdsourcing of knowledge graph data with Chinese as the core, and promote the open source and open source of knowledge graph algorithms, tools and platforms.

00df7d13b8a1d9f0bf83ea4e206adfb7.png

Click to read the original text and enter the OpenKG website.

Guess you like

Origin blog.csdn.net/TgqDT3gGaMdkHasLZv/article/details/130436973