Papers REVIEW: RESOURCE ELASTICITY IN DISTRIBUTED DEEP LEARNING

1 Introduction

  On the supply side is currently distributed learning resources, are based on similar tasks before manually set, but for the first run of the load, the only trial and error to find the optimal allocation of resources.

  But the cost of trial and error is very high, to spend a few minutes each round iterative reconstruction diagram, and determine the amount of resources allocated to the current job needs to know the size of the job characteristics in advance.

  So now the resource allocation policy is over-allocated, this is not good, there are two: First, the waste of resources, not only costly, but there is no efficient use of physical resources; second, over-allocation can not solve the stragglers
problem, that is, if there is a Taiwan inefficient machine, due to the distributed efficiency of the machine is to finalize the decision, therefore, the presence of individual machines down the entire cluster phenomenon.

1.1 Main challenges

  A challenge 

  The current user habits resource allocation depends on the mainstream of distributed learning systems, such as TensorFlow and PyTorch.

  TensorFlow cluster size is set on a good start, and after the beginning of the training is not dynamically change; dynamically change resource PyTorch mainly reflected on the input and operation.

  Existing job resource allocation within its life cycle is still the same, but in the face of strong demand for resources under the dynamic changes in the existing system is difficult to provide a good expansion of such needs.

  Challenge 2

  Simple scaling out training leads to batch size increases, the impact of the convergence of the model (refer to detailed blog ) Simply put, large batch size training easily converge sharpminmum, training and small group sizes will converge into a flat minimum

1.2 Autoscaling engine for distributed learning   

  This design of the automatic scaling engine can change the resource allocation process, we reuse the existing system processes and stores all relevant state programs in memory, in order to minimize idle time

  In consideration of the above ideas at the same time, the following contributions:

  1. Description of the current advanced distributed learning system at the structural level scaling system resource limits

  2. Design a distributed learning heuristic zooming strategy, which takes into account the cost and throughput

  3. is the first no over-allocation of resources to solve the problem of distributed learning engine straggler

2. BackGround

 

Guess you like

Origin www.cnblogs.com/o-din/p/12634855.html