Half an hour of training one hundred million scale mapping knowledge, open source Amazon AI knowledge representation framework embedded map DGL-KE

Produced | AI technology base camp (ID: rgznai100)

Mapping knowledge (Knowledge Graph) as an important technology in recent years has been widely used in various fields of information retrieval, natural language processing, and recommendation systems. Embedding learn map represents (Knowledge Graph Embeddings) is a method for generating a feature node (node ​​feature) from the knowledge of unsupervised pattern configuration, the resulting features may be used on a variety of machine learning tasks. For example, by embedding it can be predicted whether a node is represented by a link (link prediction) between two nodes.

However, with the development of a typical illustration of social networking, recommendation systems and other data of the scene, the scale of mapping knowledge is constantly growing. In a real scenario in the industry, technicians often need to face the ten million or even hundreds of millions of large-scale map data level nodes. How to quickly and efficiently embedded represented on a large scale mapping knowledge training is currently a challenge.

Recently, Amazon AI team after DGL, has open sourced a new training framework DGL-KE specifically indicated for large-scale mapping knowledge embedded, designed to allow researchers and industry to facilitate users to quickly map data in large-scale knowledge machine learning on a training mission set.

 

github Address: https://github.com/awslabs/dgl-ke

 

Compared to the existing open-source framework, DGL-KE's highlights are as follows:

 

  • Supports all major mapping knowledge representation learning algorithms, including TransE, ComplEx, DistMult, TransR, RESCAL, RotatE and so on;

  • The only existing open source framework to support multi-core CPU, multi-card GPU, CPU-GPU hybrid training, and distributed training knowledge map embedded presentation framework;

  • Easy to use, users do not need to write code directly to the map data as input to knowledge;

  • High-performance and scalability. According DGL-KE released Freebase data set (more than 86 million nodes, 300 million sides) Benchmark displayed on AWS EC2 platform, a p3.16xlarge (8 GPUs) training can be completed within 100 minutes. 4 r5dn.24xlarge (4 * 48 CPUs) training can be done in 30 minutes, and reaches linear speedup. This result is similar than the current fastest system (such as Facebook released Pytorch-BigGraph) 2-5 times faster.

 

FIG 1: DGL-KE System Architecture

DGL-KE been able to have such a performance, mainly because of the introduction of many innovative systems and optimization algorithms:

 

(A) segmentation algorithm based on distributed training METIS map

 

For large scale map data for training, distributed training is essential. Which the main idea is to enlarge an original divided into different sub-graphs, each machine is responsible for stochastic gradient descent training on a subgraph, synchronization server model parameter (Parameter Server) among all the machines. The architecture as shown below:

Figure 2: DGL distributed architecture

 

However, if only for a large map for random cutting, will cause a huge amount of data communication between the machine and the training parameters server (local machine needs to request from a remote machine model data they need), resulting in network bottlenecks. To solve this problem, DGL-KE will be pre-cut to the raw data before training by METIS graph partitioning algorithm.

METIS algorithm is an efficient view of a computer scientist George Karypis proposed segmentation algorithm in 1995, while George Karypis is also one of the authors DGL-KE project. METIS algorithm will be a node on a large map associated placed in the same division (partition) as much as possible. So most of the network communication overhead can be converted into local memory copy machine, thereby greatly enhancing the speed of distributed training.

In this practical training dataset Freebase, METIS algorithm can save almost 90% of the model of network bandwidth, so that the distributed training to achieve linear speedup. Distributed training DGL-KE uses DGL-KVStore components. DGL-KVStore DGL system is designed for custom-developed server module parameter is used to synchronize communications sparse model. The assembly is achieved by the bottom of the C ++ socket, message queues, and orientation to optimize sparse data serialization, and can seamlessly METIS FIG segmentation algorithm.

 

(B) based on shared memory multi-process single training

 

Multi-Core (Multi-core) has become the standard current computer architecture, many powerful workstation in a machine there will be more than dozens of CPU cores and hundreds of GB of memory, and even on T. For many ten million map data nodes, this type of stand-alone performance has been enough to handle the data of this size.

DGL-KE for this scene also made the corresponding system optimization, allowing users as much as possible the performance limits of a mining machine. With the traditional multi-threaded (Multi-thread) parallel optimization based on different, DGL-KE uses a multi-process based on a coarse-grained (Multi-Process) parallel optimization. Coarse-grained parallelism can enhance the maximum limit of the program running parallelism, thereby increasing speedup. Further, DGL-KE modeled by synchronizing shared memory (Shared-memory) between different processes, greatly reducing the overhead of inter-process communication.

 

Figure 3: based on shared memory multi-process single training

 

(C) CPU-GPU hybrid training

 

Training process knowledge embedded map representation will generate a lot of matrix operations, and matrix operations can be accelerated by the GPU. For small-scale map data for, DGL-KE allows users to complete graphical models into the GPU in training, so as to achieve optimal performance. However, compared to the CPU memory, GPU memory is much smaller, but the size of a model Embeddings exceeds the GPU memory limit can not be trained. For such a scenario, DGL-KE provides users with a CPU-GPU hybrid training mode.


CPU-GPU mixed in the training mode, the CPU is stored in the model Embeddings memory, and stores a copy of GPU mini-batch manner by a small amount of data in each iteration will be a process from the CPU to the GPU training. To avoid the overhead of copying data between the CPU and the GPU, DGL-KE way asynchronous training data copy together with the calculated overlap. However, asynchronous computation model will bring down the convergence speed and accuracy, DGL-KE here uses another optimization, and the Entity update Embedding Embedding the Relation of different ways: relation synchronous update, and using asynchronous enity update.

This is done because in the actual training process, relation in many data sets are long-tailed distribution showed that certain kinds of relation types occupy the vast majority, so the asynchronous update causes a large amount in relation embedding the training process the model of conflict, thus affecting the convergence and accuracy of the model. The entity in the training process is usually sparse, asynchronous training will only produce so little conflict. With such a simple optimization, DGL-KE both to ensure the convergence of model training, and can ensure the system performance.

 

FIG 4: CPU-GPU mixing Training

 

In addition to the above optimization, DGL-KE also provides a number of other optimization methods. For example, the use of negative acceleration Joint Negative Sampler sampling process used to reduce the data copy Relation Partition training process, and the use of the model to ensure convergence Periodic synchronization and the like. DGL-KE built more knowledge map data sets to handle formats, users can download.

In addition, DGL-KE Benchmark provides training on two small data sets FB15k, wn18, as well as a large data set Freebase, the user can directly through the script to reproduce the results of the training provided. And compared to conventional open source framework, DGL-KE significant performance advantages, as the results show the comparison with DGL-KE Graphvite performance compared, and on and Pytorch-Biggraph FB15k Freebase data set in the data set.

 

DGL-KE vs Graphvite

DGL-KE vs Pytorch-Biggraph

【end】


Welcome to all developers under the Fanger Wei code scanning fill out the "big developers and AI research", just 2 minutes, you can harvest value of 299 yuan, "AI developers million people congress" live online tickets!

推荐阅读全球呼吸机告急!医疗科技巨头美敦力“开源”设计图和源代码使用大batch优化深度学习:训练BERT仅需76分钟 | ICLR 2020
互联网之父确诊新冠,一代传奇:任谷歌副总裁、NASA 访问科学家
微软为一人收购一公司?破解索尼程序、写黑客小说,看他彪悍的程序人生!在Kubernetes上部署一个简单的、类PaaS的平台,原来这么容易!2020年,这20个大家都认识的加密交易所过得怎么样?你点的每个“在看”,我都认真当成了AI
Released 1375 original articles · won praise 10000 + · views 6.85 million +

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/105354901