Byte beating open source high performance distributed training framework BytePS: TensorFlow compatible with other mainstream frame

Recently, the byte beat AI Lab announced a high-performance open source distributed depth learning and training framework BytePS, subverts the genre over the past few years allreduce had the upper hand of the situation in terms of performance, exceeding all other currently distributed to more than double the framework of training performance, and at the same time to support Tensorflow, PyTorch, MXNet and other open source libraries.

BytePS combines the byte beat AI Lab months more research achievements and optimization of distributed communication training, including communication priority scheduling, PS of RDMA implementation, optimized for PCIe switch with NUMA, and the architecture itself BytePS of innovation.

Depth study of the effect depends on the model and data, and constantly refresh the current depth learning accuracy of the latest research in the industry, mostly based on larger models and larger data sets. However, large models and large data computing power when training made high demands, a single GPU card, or GPU card on a single server, it has been far can not meet the needs of internal training mission. Therefore, the efficiency of distributed training, that use multiple servers collaborative training, now has become the core competitiveness depth learning system.

All along, there are two schools in the distributed training, are allreduce and PS (Parameter Server). Over the past three years, especially Baidu made allreduce, and then based on allreduce of Horovod, awareness within the industry of open source in Uber, allreduce distributed training is the best means of communication, and performance PS achieved in the past do exist and allreduce a certain gap.

BytePS the subversion of the long-time leader allreduce situation, BytePS has a beyond all other currently distributed to more than double the performance of the training framework, including the open-source NVIDIA NCCL, Uber open source Horovod, and Tensorflow, PyTorch, MXNet comes distributed training programs.

BytePS development team said that in the public cloud or private cloud shared such a cluster, after PS compact design and realization of high-quality, better than allreduce PS architecture not only poor, but also in some environments than allreduce also can get twice as high speeds.

Test, BytePS team uses a virtual machine on a public cloud, each virtual machine has 8 Tesla V100 16GB GPU, a high-speed interconnect between the GPU by NVLink. batch size is selected on each of the GPU 64. Between virtual machines connected via a 20Gbps TCP / IP network. In this case, since the machine's bandwidth is large enough, TCP / IP network bandwidth has become a major bottleneck.

BytePS selected Resnet50 two models and VGG16 Reviews, wherein Resnet50 computationally intensive model (low communication requirements, optimization small space), VGG16 intensive communication model (high communication requirement, large space optimization), control group chose one of the market's most popular communication framework Horovod-NCCL (allreduce algorithm based), the number the better the performance indicators for the training of ImageNet pictures per second, the higher representative.

 

 

Two sets of results can be seen, the model for Resnet50 computationally intensive, BytePS Horovod-NCCL performance over nearly 44%; and for communication-intensive VGG16 model, BytePS Horovod-NCCL performance can be more than approximately 100%.

BytePS team is also equipped with 100Gbps of RDMA network of private clusters to do the test, BytePS also has some performance improvements, detailed analysis see GitHub ( https://github.com/bytedance/byteps ).

In addition to all other distributed beyond the current framework of training in performance outside, BytePS compatible Tensorflow, PyTorch, MXNet and other training framework. BytePS team said, developers need only very small changes, you can use BytePS framework for distributed training, enjoy high performance BytePS brings.

Earlier in the industry to achieve PS, it is directed at a specific common framework, for example, specifically for the PS TensorFlow achieve, but also specifically for the PS MXNet achieve.

Artificial Intelligence Laboratory byte beating open source BytePS, by implementing a common abstraction layer abstraction layer can be cited various common framework to achieve the possibility of simultaneously supporting multiple frames, it is possible to support Tensorflow, PyTorch, MXNet other industries mainstream training framework.

 

BytePS provides TensorFlow, PyTorch, MXNet and Keras plug-in, users simply plug in reference BytePS code, you can get a high-performance distributed training. BytePS core logic is implemented in BytePS core inside. Specific details of the communication entirely by BytePS, the user does not need to worry about completely.

BytePS team said the depth of field of study still has a very large space and possibilities worth exploring with industry colleagues, open source BytePS, is hoping to use BytePS advancement in performance and functionality, developers and reduce the depth of field of study participants threshold, to help more fellow human exploration of deep learning together to improve the efficiency of AI applications.

Guess you like

Origin www.oschina.net/news/107811/bytedance-opensource-byteps