AI distributed training advanced chapter

Table of contents

What are the ideas behind AI distributed algorithms?

Parameter Server algorithm

Ring Allreduce algorithm

The evolution of algorithmic ideas of Ring Allreduce

OrionX helps AI distributed training


I have previously introduced you to the relevant knowledge of OrionX empowering deep learning distributed trainingOrionX (Orion) AI accelerator resource pooling software empowering deep learning distributed training a>, understand the important role that distributed training plays in AI scenarios. Currently, whether it is the mainstream deep learning framework (such as TensorFlow, PyTorch, PaddlePaddle, MXNet, etc.) or distributed training tools (Horovod, DeepSpeed ) are constantly trying to break through and optimize the AI ​​distributed training algorithm to meet the needs for efficient model training in larger scale and more complex scenarios.

When talking about AI distributed training, you should understand the principles behind it. In fact, the distributed concept was not proposed during the great development of AI in the past ten years. The earlier HPC and big data fields have already output a lot of practices and results in the field of distributed computing. However, due to the special forward and reverse characteristics of the current AI algorithm, The computing mechanism poses a more difficult challenge to the distributed algorithm. The AI ​​distributed algorithm draws on the experience of its predecessors and has embarked on a unique path of innovation.


What are the ideas behind AI distributed algorithms?

The current mainstream AI distributed algorithms are divided into two categories, namelyParameter Server and Ring Allreduce

  • ParameterServer (Parameter Server):Parameter Server is a programming framework used to facilitate the writing of distributed parallel programs. The focus is on Support for distributed storage and collaboration of large-scale parameters. The parameter server concept first came from the parallel LDA framework proposed by Alex Smola in 2010. Later, Jeff Dean of Google further proposed the first-generation Google Brain solution: DistBelief. Later, Li Mu, author of MXNet and chief scientist of AWS, proposed the third-generation parameter server idea in 2014, with the paper "Parameter Server for Distributed Machine Learning" 
  • RingAllreduce (ring reduction):The concept of Allreduce reduction was born in the field of parallel computing very early, and is often used in MPI in HPC (Message-Passing Interface, Message Passing Interface) There is a common interface MPI_Allreduce in the standard. Later, in 2016, Baidu Silicon Valley Artificial Intelligence Laboratory (SVAIL) introduced the Ring Allreduce algorithm into deep learning for the first time, opening a new chapter in AI distributed training, followed by Nvidia's NCCL, Uber's open source Horovod, and PyTo 

Guess you like

Origin blog.csdn.net/m0_49711991/article/details/120287034