How to solve "STALENESS" in ASGD? Staleness-aware Async-SGD for Distributed Deep Learning gives you an explanation of the paper intensive reading

Original paper: Staleness-aware Async-SGD for Distributed Deep Learning

ABS

Using asynchronous SGD SGDWhen SGD , how to control hyperparameters is an important issue (such as how to adjust the learning rate according to the progress of training). The setting of hyperparameters is important for asynchronousSGD SGDSG D has a very important influence.

This article proposes asynchronous SGD SGDA variant of SGD that adjusts the learning rate based on the gradient delay. And the convergence of the algorithm is theoretically guaranteed.

1 INTRO

Commonly used distributed machine learning methods include SSGD SSGDSSGD (Simultaneous Stochastic Gradient Descent) and ASGD ASGDA SG D (Asynchronous Stochastic Gradient Descent).

For SSGD SSGDFor SSG D , in each round, all working nodes must wait for the slowest working node to complete the calculation before starting the next round of calculation, which greatly affects the training speed.

And for ASGD ASGDFor A, SG and D , there is no need to wait for all work nodes to complete the calculation of this round. Instead, no work node completes the calculation. After the work node is updated, the next round of calculation can be directly performed, which leads to new problems: The gradient is out of date, that is, the parameters being calculated by the node may lag behind the latest parameters. This effect is huge. When the number of training rounds is fixed,ASGD ASGDThe training effect of A SG D is better than SSGD SSGDSSG D is much worse.

The author discovered ASGD ASGDThe degree of convergence of ASGD will be greatly affected by hyperparameters (such as learning rate and batch size) and the implementation of the distributed system (such as synchronization protocol, number of nodes), and there is currently a lack of information on setting hyperparameters . to improveASGD ASGDA study on the effects of A SG D.

So the author of this article decided to propose an ASGD ASGD that can automatically adjust the learning rate.A SG D is so good thatASGD ASGDA SG D avoids the trouble of gradient expiration. In this method, gradient expiration is recorded, and the learning rate is divided by the degree of gradient expiration to obtain the new learning rate. And it is proved that the convergence rate of this method is consistent withSSGD SSGDSame as SSG D.

Some previous related work can achieve good results on distributed systems with a small number of nodes by reducing the learning rate at an exponential level. However, in large distributed systems, the learning rate will become a special value as the training progresses. Small (close to 0 00 ), which will cause the entire system to converge too slowly.

2 System Architecture

This part will describe the design of the entire algorithm. The architecture of the two algorithms is given here:

  • n − s o f t s y n c   p r o t o c o l n-softsync\ protocol nso f t sy n c p ro t oco l  : Able to sense part of the gradient expiration and prevent excessive expiration through soft synchronization;
  • h a r d s y n c   p r o t o c o l hardsync\ protocol ha r d sy n c p ro t oco l  : that is,SSGD SSGDSSG D,WorkB aseline BaselineB a se l in e for comparison.

2.1 Architecture Overview

Description of some parameters:

  • λ \lambdaλ : The number of working nodes;
  • μ \muμ : batch size of each node;
  • a \alphaα : learning rate;
  • E p o c h Epoch Ep oc h : iterative version;
  • T i m e s t a m p Timestamp T im es t am p : Timestamp, the timestamp will increase every time the parameters are updated. Each gradient will record the current timestamp.
  • τ i , l \tau_{i,l}ti,l: working node llThe degree of gradient expiration of l . If a working node is iniii moment sentjjThe gradient at time j , which means that the currently sent gradient is expired, thenτ i , l = i − j \tau_{i,l}=ijti,l=ij

Each working node will perform the following operations in sequence:

  • g e t M i n i b a t c h getMinibatch g e tM iniba t c h : Get the batch used for calculation in this round.
  • p u l l W e i g h t s pullWeights p u ll W e i g h t s : Get the weights from the parameter server.
  • c a l c G r a d i e n t calcGradient c a l c G r a d i e n t : Calculate the gradient of the current batch based on the weights.
  • p u s h G r a d i e n t pushGradient p u s h G r a d i e n t : Send the calculated gradient to the parameter server.

The parameter server will perform the following operations:

  • s u m G r a d i e n t sumGradient s u m G r a d i e n t : Aggregate the received gradients.
  • a p p l y U p d a t e applyUpdate a ppl y U p d a t e : Update parameters with aggregated gradients.

2.2 Synchronization Protocols

h a r d s y n c   p r o t o c o l hardsync\ protocol ha r d sy n c p ro t oco l  : During each round of update, the parameter server will executesum G radient sumGradients u m G r a d i e n t operation, and thenapply U pdate applyUpdatea ppl y U p d a t eOnly when these two operations are completed, the working node can passpull W eights pullWeightsp u ll W e i g h t s obtains the parameters of a new round, which means that the worker node must wait for the slowest worker node to complete the task. Although this method is slower, the training accuracy is very high.

n − s o f t s y n c   p r o t o c l n-softsync\ protocl nso f t sy n c p ro t oc l  : In this method, the parameter server only needs to receive at leastc = [ λ n ] c=[\frac{\lambda}{n}]c=[nlAfter ] gradient, you can perform thisccGradient aggregation operation of c nodes, and then the parameters are updated. At this time, the completed nodes can obtain new parameters and continue training (wherennn is a hyperparameter ifn = λ n=\lambdan=λThis is similar to the previously proposedASGD ASGDA SG D is similar, the learning rate updated by the method in this article changes dynamically). The updated formula is as shown below:

Insert image description here

The learning rate is calculated based on the degree of expiration.

2.3 Implementation Details

The method in this article still adopts some synchronization mechanisms, such as pull W eights pullWeightsp u ll W e i g h t s only whenpush G radient pushGradientThe parameters can be obtained only after p u s h G r a di e n t is completed. This can ensure that the parameters of the model are inconsistent due to concurrency.

In implementation, a computer may have multiple working nodes. Model parallelism is not used.

2.4 Staleness Analysis

For hardsync protocol hardsync\ protocolha r d sy n c p ro t oco lThe  server must receive the gradient of all nodes before updating, so expiration will not occur, that is to sayτ = 0 \tau=0t=0

For different values ​​nnThe degree of expiration that occurs in n is different. The author for λ = 30 \lambda=30l=30 differentnnn carried outτ \tauCalculation of τ , the results are as shown in Figure 1 Figure\ 1F i gu re 1  is shown:

Insert image description here

What puzzles me about the above results is that when n = 1 n=1n=1 , this means that the parameter server has only received at least30 30The parameters will not be updated until 30 gradients, which means that at this time it should be the same as SSGD SSGDSSG D produces the same behavior, i.e.τ = 0 \tau=0t=0 will be true, but the experimental results show that this is not the case, so I guess the 30 30received here30 gradients may have two gradients from the same node, but according to2.3 2.3Judging from the description in 2.3 , pull W eights pullWeightspullWeights p u s h G r a d i e n t pushGradient p u s h G r a di e n t can only be executed after it is completed. This seems to be a bit of a conflict .

Through experiments, the author found that for n − softsync protocl n-softsync\ protoclnso f t sy n c p ro t oc lwhere  τ \tauτ is mostly equal tonnn . With this intuitive understanding, we can controlnnn to control the degree of expiration that can be tolerated.

time iii , nodellThe learning rate of l is determined by the following formula:
α i , l = α 0 τ i , lif τ i , l > 0 α i , l = α 0 if τ i , l = 0 \alpha_{i,l}=\frac {\alpha_0}{\tau_{i,l}}\quad if\ \tau_{i,l} > 0\\ \alpha_{i,l}=\alpha_0\quad if\ \tau_{i,l} = 0\\ai,l=ti,la0if τi,l>0ai,l=a0ifτ i,l=0

3 Theoretical Analysis

This part proves the convergence, but due to my limited ability, I don’t understand it very much ( mainly because I don’t want to read it ).

4 Experimental Results

4.1 Hardware and Benchmark Datasets

This section presents details of the equipment and datasets used. Can be viewed in the original text.

4.2 Runtime Evaluation

F i g u r e   2 Figure\ 2 F i g u re 2  shows the speed improvement brought by updating without receiving a gradient on two data sets under different numbers of nodes:

Insert image description here

It can be seen that the graph shows that the method proposed in this article provides approximately linear acceleration (as the number of nodes increases).

4.3 Model Accuracy Evaluation

F i g u r e   3 Figure\ 3 F i gu re 3  demonstratesCIFAR 10 CIFAR10Training error and test accuracy results on C I F A R 10 : (the two pictures above use fixed learning rates, while the two pictures below use dynamically updated learning rates)

Insert image description here

F i g u r e   4 Figure\ 4 F i gu re 4  shows that inImage N et ImageNetResults on I m a g e N e t :

Insert image description here

As can be seen from the comparison of the pictures, when the learning rate is dynamically updated according to the degree of expiration, n − softsync n-softsyncnso f t sy n c andSSGD SSGDThe fit of SSG D is better.

5 Conclusion

Contributions of this article:

  • Proved ASGD ASGDA SG D can be compared withSGD SGDSG D has the same convergence speed;
  • Quantified the extent to which gradients expire on some datasets. At the same time, a maximum of 30 30 was used in the experimentWith 30 working nodes, it is considered an attempt at a relatively large distributed system.

Guess you like

Origin blog.csdn.net/qq_45523675/article/details/129303946