HET, a sparse large model training acceleration program cooperating with Tencent and Peking University, was selected into the international top conference VLDB

Recently, the machine learning team of Tencent TEG Data Platform Department and Peking University-Tencent Collaborative Innovation Lab jointly developed a new sparse large model training acceleration solution HET. Its research results "HET: Scaling out Huge Embedding Model Training via Cache-enabled" Distributed Framework" has been accepted by the International Summit VLDB 2022. HET proposes a novel training method based on Embedding cache, which can significantly reduce the communication overhead during distributed training of sparse large models and improve the overall efficiency of model training.

HET is now officially open source: https://github.com/PKU-DAIR/Hetu

Sparse large models are becoming more and more common, and communication bottlenecks may become "fatal" problems in training efficiency

Figure 1 Scale development of deep learning models

Sparse large model is one of the most important types of deep learning models, and is widely used in search advertising recommendation, graph representation learning and other scenarios. In recent years, with the gradual increase in the scale of data, the scale of large-scale sparse models in the industry has become increasingly large, and the number of parameters can reach trillions. As shown in Figure 1, the recommended model (DLRM) supported by the ZionEX [Note 1] system proposed by Facebook this year has exceeded 10 trillion in size, far exceeding the Switch Transformer with 1.6 trillion parameters previously released by Google [ See Note 2] Model for details.

The parameters of the sparse model, namely the Embedding parameters, can reach more than 99% of the total model parameters. Compared with other models, this type of model has lower computational density and larger model scale, which also brings serious challenges to distributed deep learning systems. In recent years, how to improve the training efficiency of sparse large models has gradually become a hot issue in both academia and industry.

For a trillion-scale model, 3.7TB of memory space is required for model parameters alone. Due to the large scale of sparse parameters in sparse large models, the industry generally adopts the solution based on parameter server (Parameter Server) to divide the embedding evenly into different servers. During the training process, the computing node dynamically pulls the required Embedding vector from the parameter server in the form of sparse communication, and submits the gradient of the Embedding back to the parameter server after completing the current round of calculation. Although this method can flexibly scale the model, it also faces serious communication bottlenecks. Taking the mainstream deep learning framework TensorFlow as an example, in the actual data test, the communication time even accounts for more than 80% of the total training time. At present, most of the improvement directions are to optimize the engineering implementation of the parameter server, such as fully exploiting the hardware performance to improve the throughput of the entire system. However, the problem of large communication volume of sparse parameters has not been fundamentally solved, and communication is still the core pain point of the system. Therefore, a solution is needed to solve the communication problem from the source.

HET: A Sparse Large Model Training System Based on Embedding Cache

core idea

Figure 2 Embedding access frequency distribution on three commonly used public datasets

According to observations from business scenarios, the input data features of high-dimensional sparse large models often have the characteristics of skewed distribution and have a power-law distribution (as shown in Figure 2), which leads to the imbalance of the embedding vector during the training process of the model. access. Taking the recommendation dataset Criteo as an example, about 10% of the embedding vectors trigger 90% of the embedding visits of the entire dataset. During the training process, these high-frequency Embeddings will be pulled and pushed frequently, becoming the main load of communication.

We take advantage of this feature to propose the idea of ​​Embedding caching: if we can cache these high-frequency Embeddings with limited memory space on computing nodes, there is a chance to avoid a large number of remote Embedding accesses, thereby alleviating communication bottlenecks. Based on this idea, we propose a new generation of sparse large model training framework HET based on Embedding cache.

Technical point 1: Hybrid communication architecture that supports Embedding parameter caching

Figure 3 HET system architecture

In view of the characteristics of both sparse and dense parts in the parameters of the sparse large model, HET adopts the hybrid communication architecture of parameter server (PS) and global reduction (AllReduce) as a whole to give full play to the advantages of both, as shown in Figure 3 . Among them, AllReduce is suitable for the synchronization of dense parameters, and can fully utilize the bandwidth between GPUs with the help of communication libraries such as NCCL, while the parameter server naturally supports sparse communication and has high flexibility in synchronization protocols. At the same time, we also designed the Cache Embedding Table structure on the computing node to cache the Embedding parameters of high frequency access.

The use of Cache Embedding Table on each computing node can save a lot of traffic, but it also brings a new problem, that is, for a specific Embedding, its copies may exist in multiple different computing node caches at the same time. . If the consistency between replicas is not considered, the model training may diverge and fail to converge. To this end, we further propose a finite asynchronous protocol based on fine-grained embedding clocks to solve the problem of how to synchronize these embedding copies between different nodes.

Technical point 2: Finite asynchronous protocol based on fine-grained Embedding clock

Figure 4 Cache Embedding Table structure in HET

Generally speaking, Embedding parameters are organized in tables to support sparse access. In order to measure the consistency between Embedding copies, we introduce an important Lamport clock for each Embedding vector based on the conventional key-value data structure to record the state of the Embedding vector. During the model training process, by comparing the clock of the Embedding, you can know the delay or advance of the replica.

Figure 5 Cache read and write operations in HET

For the Embedding cache table, we allow both reads of older Embeddings and delayed write-back of gradient updates on the cache. In order to ensure the training quality of the model while giving full play to the cache acceleration effect, we limit the clock difference between each Embedding copy and the global Embedding to not exceed a preset threshold. In this case, each copy of Embedding will not be too ahead or behind its other copies.

From a global perspective, the sparse and dense parts of the entire model use different synchronization modes, respectively. The dense parameters are communicated with a fully synchronous protocol, and the sparse parameters are communicated with a limited asynchronous protocol based on fine-grained embedding clocks. After theoretical analysis, we further prove that this finite asynchronous protocol based on fine-grained Embedding clock can guarantee convergence similar to the fully synchronous protocol. (See the paper link for details)

Experimental results

We compared HET with TensorFlow based on traditional parameter server architecture and Parallax [see Note 3], which is also a hybrid communication architecture with parameter server and global protocol. The selected datasets and models include: recommendation model Wide&Deep (WDL), DeepFM (DFM), Deep&Cross (DCN) and the dataset Criteo have more than 30 million sparse features. When the Embedding dimension is expanded to 4K, the model parameters can reach trillions; as well as the graph learning model GraphSAGE and the datasets Reddit, Amazon, ogbn-mag (OGB is also one of the most authoritative graph learning benchmark datasets, Open Graph Benchmark).

end-to-end comparison

Figure 6 Convergence effect comparison

Figure 7 End-to-end convergence speed comparison

Combining Figure 6 and Figure 7, we can see that when the upper bound of the clock skew threshold is set to 100, compared with TensorFlow and Parallax, HET can achieve a speedup of 6.37-20.68 times without significantly affecting the model convergence. . For the HET itself, the fine-grained Embedding cache brings a speedup of 4.36-5.14 times, which can reduce the sparse parameter communication by up to 88%.

Cache effect comparison:

Figure 8 Cache failure rate under different Cache space sizes

As can be seen from Figure 8, only a small amount of Cache space, such as 15% of the total parameter size, can achieve a cache hit rate of almost 97%, that is, 97% of Embedding accesses can be accessed through the local cache without communication. . In addition, we also noticed that different Cache implementation strategies have slightly different effects. LFU can capture the long-term access tendency, so the failure rate is lower than that of LRU.

Extensibility:

Figure 9 Convergence effect under different parameter scales

We expand the model to 32 nodes, and set the Embedding dimension to 4096. At this time, the total number of parameters has reached a trillion scale. It can be seen from Figure 9 that the execution time of HET is still significantly better than other baseline schemes, thus illustrating the effectiveness of HET.

Machine Learning Team of Tencent TEG Data Platform Department:

The team is committed to developing Tencent's distributed machine learning platform Angel, which solves the problem of high-dimensional model and sparse data training. Angel was born in Tencent's big data ecosystem. By integrating big data, traditional machine learning and deep learning ecosystems, Angel has established an end-to-end machine learning platform with functions covering traditional machine learning, graph mining, graph learning, deep learning, and privacy computing. . Within Tencent, Angel has been widely used in advertising recommendation, financial risk control, user portrait and short video recommendation. In addition to serving the company's internal business, Angel was open sourced to the outside world in 2017 and is the first top-level project of the LF AI Foundation in China.

To address the challenges of performance and scalability as the model scales up. Angel platform engineers cooperated with Peking University-Tencent Collaborative Innovation Lab to launch HEAP, a sparse large model training framework, and in the entire advertising recommendation chain, relying on Tencent's internal business needs to speed up and improve the training of models of various scales. A number of prospective studies including the training of a new generation of sparse large models based on the Embedding cache, the training of trillions of Embedding models based on the hierarchical parameter server, and the performance optimization of multi-GPU distributed training have been implemented in fine-arrangement, rough-arrangement, and pre-advertising. In the training of multiple business models such as sorting and recall, the cumulative GMV obtained in Tencent's various business lines increased by about 4%. The published research results HET is also a new exploration under this framework.

Peking University-Tencent Collaborative Innovation Lab:

The Peking University-Tencent Collaborative Innovation Lab was established in 2017. It mainly carries out cutting-edge exploration and talent training in the fields of artificial intelligence and big data, and builds an internationally leading school-enterprise cooperative scientific research platform and an industry-university-research base.

Through cooperative research, the laboratory has made important achievements and progress in theoretical and technological innovation, system research and development and industrial application. It has published more than 20 academic papers in top international academic conferences and journals. In addition to the cooperative research and development of Angel, the laboratory also independently developed Several open source systems have been developed, such as:

Distributed Deep Learning System Hetu (Hetu)

https://github.com/PKU-DAIR/Hetu

Black box optimization system OpenBox

https://github.com/PKU-DAIR/open-box

In August of this year, the laboratory has announced that the self-developed deep learning framework "Hetu" will be integrated into the Angel ecosystem. The Peking University and Tencent teams will jointly build Angel4.0, a new generation of distributed deep learning platform for people with massive training data. , deep learning training scenarios with large model parameters, bringing new large-scale deep learning solutions to the industry.

To learn more about HET, please visit the link below:

Paper address (preview version):

https://github.com/Hsword/Het/blob/main/vldb2021_het.pdf

project address:

https://github.com/PKU-DAIR/Hetu

References:

[1] Mudigere D, Hao Y, Huang J, et al. High-performance, distributed training of large-scale deep learning recommendation models[J]. arXiv preprint arXiv:2104.05158, 2021.

[2] Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. arXiv preprint arXiv:2101.03961, 2021.

[3] Kim S, Yu G I, Park H, et al. Parallax: Sparsity-aware data parallel training of deep neural networks[C]//Proceedings of the Fourteenth EuroSys Conference 2019. 2019: 1-15.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324140688&siteId=291194637