How to solve the regional problems of distributed machine learning? DLion: Decentralized Distributed Deep Learning in Micro-Clouds Paper Intensive Reading

Original link: DLion (acm.org)

ABS

Due to the popularity of smartphones, edge devices will obtain more and more user data. However, it is unrealistic to collect all user data in the data center for distributed training. This mainly concerns user privacy.

At the same time, if the collected data is directly calculated locally on the user's computer, when the model is too large, the training will be insufficient due to the performance of the edge device.

Due to the above two reasons, the author hopes to propose a distributed architecture based on micro-cloud computing.

1 INTRO

The popularity of edge devices has generated a large amount of data. Building data centers in traditional ways faces two problems:

  • Too much data needs to be moved and it is difficult to achieve;
  • Regarding user privacy issues, users are not willing to disclose their data.

Federated learning was proposed to solve privacy issues. This method will be trained directly on the edge device. However, when the model is too large, even if the amount of data that the edge device needs to process is small, it will still take a long time. (A relatively large model may contain hundreds of megabytes of parameter size, which means that each sample requires hundreds of megabytes of memory, which is 8GB, 6GB, 8GB, 6GB for current smartphones.8 GB , 6 GB of memory is very difficult, and this does not take into account the computing power gap between mobile phones and computers)

The performance of micro clouds is far better than that of edge devices. At the same time, using micro clouds can only upload data in local areas, which also protects user privacy to a certain extent.

F i g u r e   1 Figure\ 1 Figure 1 is a basic architecture for  distributed computing using micro clouds :

Insert image description here

There are two major challenges in using micro clouds for computing:

  • Performance differences and dynamic variability of micro-clouds: The performance of micro-clouds in different regions may vary greatly. At the same time, because devices in a micro-cloud may be arranged to handle other services, the performance of the same micro-cloud also changes dynamically over time. of;
  • Differences and dynamics of network resources: LAN LAN within the same micro cloudL A N to communicate, while different micro clouds need to go throughWAN WANW A N ,ONE ONEW A N resources are very limited. At the same time, LAN resources will be diluted as the number of nodes in the micro cloud increases, that is, the resources will change dynamically.

Most distributed systems do not take into account the differences and dynamics of resources, which will lead to an increase in training time.

DL ion DLion proposed in this articleD Lion is mainly designed to solve the computing problems of distributed systems with remote and dynamic characteristics .

D L i o n DLion D Lion uses three key technologies to solve the above problems :

  • W e i g h t e d   d y n a m i c   b a t c h i n g Weighted\ dynamic\ batching W e i g h t e d d y nami c ba t c hin g   : This technology is used to solve the problem of computing performance heterogeneity.
  • P e r − l i n k   p r i o r i t i z e d   g r a d i e n t   e x c h a n g e Per-link\ prioritized\ gradient\ exchange Perl ink p r i or i t i ze d g r a d i e n t e x c ​​han g e    : This technology is used to solve the problem of network remoteness.
  • D i r e c t   k n o w l e d g e   t r a n s f e r Direct\ knowledge\ transfer D i rec t kn o wl e d g e t r an s f er   : This technique is used to improve the accuracy of the model.

The main contributions of this article:

  • A distributed system that can solve the problem of remoteness and dynamics is proposed;
  • The designed system is concise and flexible, easy to modify, and can be applied to different distributed training;
  • Environmental T ensor F low TensorFlowT e n sor Fl o w established a prototype of the system while supportingCPU CPUCPU GPUGPU _GPU version .

2 Background and Motivation

2.1 Distributed Deep Learning

This section introduces machine learning using gradient descent and distributed machine learning using gradient descent.

Machine learning: Calculate gradients to update model parameters;

Distributed machine learning: Each working node calculates its own gradient, and updates the parameters after summarizing.

2.2 Distributed Deep Learning Systems

A framework for implementing distributed machine learning with parameter servers, such as TensorFlow TensorFlowTensorFlow M X N e t MXNet MXN e t Both providePS (Parameter S erver) PS (Parameter\ Server)The implementation of PS ( P a r a m e t er S er v er )  .

A machine learning framework that implements distributed computing without parameter servers: A ko , H op , P rague Ako, Hop, PragueI am , _Hop,Prague et al . _ _

F i g u r e   2 Figure\ 2 F i gu re 2  shows the difference between the two,( a ) (a)( a ) is a server with parameters,(b) (b)( b ) is a parameterless server (decentralized)

Insert image description here

These frameworks simplify the process of distributed machine learning.

2.3 DL Learning in Micro-Clounds

The overall structure is as Figure 3 Figure\ 3As shown in Figure 3 , the interior of the micro cloud is connected through the  LAN, and the clouds are connected through the WAN.

Insert image description here

2.4 Challenges and Motivation

What is introduced here is Section 1 Section\ 1Two challenges were mentioned in Section 1 and  will not be repeated here .

3 Our Approach: DLion

3.1 Design Goals and Overview

The framework is designed without a central parameter server.

Design goals:

  • Maximize data parallelism. Maximizing data parallelism can reduce the time required for model training, and at the same time, it is necessary to minimize the accuracy loss of training.
  • Reduce communication time: While reducing communication time between nodes, it is necessary to reduce the accuracy loss of the model as much as possible.
  • Improve model accuracy: Reduce negative impacts by sharing data between nodes.

The above three goals correspond in turn to the three key technical points proposed earlier. Figure 4 Figure\ 4F i g u re 4  demonstrates the use time of these three key technical points and the entire workflow:

Insert image description here

3.2 Weighted Dynamic Batching

First introduce two concepts:

  • L B S LBS LBS L o c a l   B a t c h   S i z e Local\ Batch\ Size L o c a l B a t c h S i zeThe   local batch size of each worker node participating in the calculation.
  • G B S GBS GBS : The global batch size of the entire distributed system for one round of training, which is the LBS LBSof all nodes.L BS sum.

It should be noted that in the traditional distributed system architecture, the LBS LBS owned by each working nodeL BS is the same.

This can usually be achieved by adding LBS LBSL BS and the number of nodes to increaseGBS GBSGBS , and here we only discuss changingLBS by LBSL BS to changeGBS GBSIn the case of GBS , the number of working nodes is fixed tonnn

Add GBS GBSGBS has advantages and disadvantages:

  • Benefits: Equivalent to the increased amount of calculations, the time required for training will be reduced;
  • Disadvantages: Increase GBS GBSGBS usually leads to a decrease in final training accuracy.

So increase GBS GBSGBS needs to find a suitable value that can reduce training time as much as possible without losing too much accuracy.

A previous article pointed out that the final convergence accuracy of the model does not necessarily need to be improved by controlling the learning rate, but can also be improved by regularly changing GBS GBSGBS is implemented. This article was inspired by this, so we designed a controller that can automatically adjust, which can reduce the calculation time while almost not losing the accuracy of the model.

When the computing performance of each working node is the same and unchanged, the LBS of each node LBSL BS is set toGBS n \frac {GBS}{n}nGBSis reasonable. However, the actual situation is that the performance of each working node is not the same, and the performance of each left node may fluctuate. If LBS LBS is still used in this case,If L BS is set to equal, then the node that completes first must wait for the node that completes last, which will cause the time required for model calculation to increase.

w e i g h t e d   d y n a m i c   b a t c h i n g weighted\ dynamic\ batching The w e i g h t e d d y nami c ba t c hin g   method has three components:

  • G B S   C o n t r o l l e r GBS\ Controller GBS C o n t r o ll er  : Automatically controlGBS GBSIncrease or decrease in GBS . The design inspiration for the modified component comes from two findings: increasing GBS GBSat the beginning of trainingGBS will cause a serious decrease in accuracy; increaseGBS GBSThe accuracy drop caused by GBS is relatively low and stable (seeFigure 5 Figure\ 5F i gu re 5  , where the horizontal axis represents theepoch epoche p oc hincrease GBSGBSGBS , the vertical axis represents the final model accuracy). These two findings prompted the authors to divide the workflow of this component into two stages:warm − up warm-upwarmup s p e e d − u p speed-up speedu p . In the first stage:GBS t + 1 = GBS t + C warmup GBS_{t+1}=GBS_t + C_{warmup}GBSt+1=GBSt+Cwarmup, when it increases to 1% 1\% of the total data volumeThe increase will stop at 1% (to prevent excessive increase from causing a decrease in accuracy). In the second stage:GBS t + 1 = GBS t × C speedup GBS_{t+1}=GBS_t\times C_{speedup}GBSt+1=GBSt×Cspeedup, when it increases to 10% 10\% of the overall data volumeStops growing at 10% (existing research finds that it should not beGBS GBSGBS is too large), parameterCCC needs to be set.

  • L B S   C o n t r o l l e r LBS\ Controller L BS C o n t ro ll er  : ConfirmGBS GBSAfter GBS , this component will determine the LBSof each working node.LBS L B S LBS The design concept of LBS is very simple, LBS LBSof nodes with strong computing powerL BS will be larger, and the amount of data will be allocated based on the proportion of computing power. (This can ensure that the calculation time of all nodes is close),Figure 6 Figure\ 6F i gu re 6  showsthe LBS LBSAdjustment of L BS . LBS LBSThe specific calculation formula of L BS is as follows, RCP RCPRCP代表 r e l a t i v e   c o m p u t a t i o n   p o w e r relative\ computation\ power re l a t i v e co m p u t a t i o n p o w er   (relative computing power):
    LBS i = GBSRCP i ∑ j = 1 n RCP j LBS_i=GBS\frac{RCP_i}{\sum_ {j=1}^{n}RCP_j}LBSi=GBSj=1nRCPjRCPi

  • w e i g h t e d   m o d e l   u p d a t e   m o d u l e weighted\ model\ update\ module w e i g h t e d m o d e l u p d a t e m o d u l e    : This module is responsible for the aggregation of parameters, nodejjAfter j calculates the gradient of this question, we get gtj g_t^jgtj, node kkk receives nodejjThe aggregation process after j will be updated as follows:
    wt + 1 k = wtk − η 1 n ∑ j = 1 ndbjkgtjdbjk = LBS j LBS kw^k_{t+1}=w^k_t-\eta \frac 1 n\ sum_{j=1}^{n}db^k_jg^j_t\\ db^k_j=\frac {LBS_j}{LBS_k}wt+1k=wtkthen1j=1ndbjkgtjdbjk=LBSkLBSj

Insert image description here

Insert image description here

3.3 Per-Link Prioritized Gradient Exchange

D a t a   q u a l i t y   a s s u r a n c e   m o d u l e Data\ quality\ assurance\ module D a t a q u a l i t y a ss u r an ce m o d u l e    : This module is responsible for selecting important gradients for update, usingMax N Max \ NM a x N  algorithm This algorithm will selectN % N\%N % of the gradient is updated.

F i g u r e   7 Figure\ 7 F i gu re 7  shows differentNNThe impact of N on the accuracy of the model.

Insert image description here

T r a n s m i s s i o n   s p e e d   a s s u r a n c e   m o d u l e Transmission\ speed\ assurance\ module T r an s mi ss i o n s p ee d a ss u r an ce m o d u l e    : This module is used to automatically determineM ax N Max\ N NNin M ​​a x N algorithmN. _ The determination method is to make it possible to select a larger NNwithout the network environment becoming a bottleneck.N , in this module, nodeiii -directed nodejjThe gradient size that j can send is evaluated by the following formula:
BW _ netj I ter _ comi BW _ netj represents the direct available bandwidth of the two nodes I ter _ comi represents the number of iterations that node i can perform per unit time\ frac {BW\_{net_j}}{Iter\_com_i}\\ BW\_{net_j} represents the direct available bandwidth of the two nodes \\ Iter\_com_i represents the number of iterations that node i can perform per unit timeIter_comiBW_netjBW_netjRepresents the direct available bandwidth between two nodesIter_comiRepresents the number of iterations that node i can perform per unit time.
The core of the above formula is to make the calculation time and communication time equal.

F i g u r e   8 Figure\ 8 F i gu re 8  is node1 11 to node3 33 and node5 55. The gradient size sent is automatically adjusted with the number of iterations:

Insert image description here

3.4 Direct Knowledge Transfer

Since the method introduced above does not use a parameter server and uses an asynchronous method, the parameters owned by each node participating in the training may be different.

The method of this part is to periodically exchange parameters: select the node with the best training effect, and other nodes obtain parameters from this node.

Several problems need to be solved when adopting this method:

  • When does the training take place;
  • Does it need to be sent to all nodes (this is a relatively large overhead);
  • How to aggregate the received parameters, directly replace or average?

In order to solve the above three questions, the author did some experiments.

F i g u r e   9 Figure\ 9 F i g u re 9  shows the experimental results:

Insert image description here

  • (a) represents DKT DKT at different timesThe time required for DK T training to complete (reach the same accuracy),early DKT early\ DKTe a r l y DK T  stands for early training,late DKT late\ DKTl a t e DK T  stands for DKT in the later stages of training, and the author found thatDKT in the early stages DKTDK T can achieve better accuracy. And the aboveDKT 100 iter DKT\ 100iterDK T 100 i t er  stands for every100 100It is performed once every 100 rounds. It can be seen that if it is performed more frequently, a large amount of network resources will be required, so the training time will increase. If the interval is too long, the convergence speed will decrease because there is no good parameter interaction, resulting in Training takes more time.
  • (b) represents the result of sending to different nodes: not used, sending to the node with the worst effect each time, and sending to all nodes each time. It can be seen that only sending to the worst node can achieve a great improvement in accuracy. But for better accuracy, the later implementation still chooses to send it to all nodes.
  • Default values: wlocal = ( 1 − λ ) wlocal + λ wbest w_{local}=(1-\lambda)w_{local}+\lambda w_{best}wlocal=(1l ) wlocal+λwbest, with λ \lambdaAs λ increases, the best weight proportion increases, which can be seen atλ = 0.75 \lambda=0.75l=The best effect is when 0.75 .

4 Implementation

Here we talk about DL ion DLionHow D Lion is implemented? If you are interested, you can read the original text .

4.1 Key Components and Operations

Some key components are introduced here, which have been introduced previously and will not be repeated here.

4.2 Generic and Flexible DLion

This section explains DL ion DLionDLion can be easily applied in different algorithms .

5 Evaluation

5.1 Methodology

5.1.1 Applications and Datasets

slightly.

5.1.2 Experimental Platforms

slightly.

T a b l e   2 Table\ 2 Table 2 is  a test of the bandwidth between different Amazon regions :

Insert image description here

5.1.3 Performance Metrics

  • The accuracy a model can achieve at a given time;
  • The time required to achieve a certain accuracy;
  • The time required for convergence;

5.1.4 Comparison Systems

Compare with some previous methods:

  • B a s e l i n e Baseline B a se l in e does not exchange all gradients, and the initial asynchronousSGD SGDSG D (This is presumed to be asynchronousSGD SGDSG D instead of synchronization is based on the results of the last experiment);
  • Me meA k o : You can view the intensive reading of previously published papers -Ako intensive reading of papers
  • G a i a Gaia Gaia
  • H o p Hop Hop

5.1.5 Experimental Setup

T a b l e   3 Table\ 3 Table 3  is the details of each environment. The numbers in the calculation represent the number of cores of each node, and the numbers in the network represent the bandwidth of each node :

No emulation No\ emulation in the pictureNo e m u l a t i o n  should mean that each node has the same computing resources .

Insert image description here

5.2 Evaluation Results

5.2.1 System Heterogeneity

This section evaluates the situation on a system composed of nodes with different network resources and computing resources.

F i g u r e   11 Figure\ 11 F i g u re 11  shows the accuracy that each fixed-time method can achieve on various configurations:

Insert image description here

H e t e r o   S Y S   A Hetero\ SYS\ A He tero S Y S A   is a node with strong computing power and strong network resources, and He tero SYS B Hetero\ SYS\ BHe tero S Y S B   is a node with strong computing resources and weak network resources .

5.2.2 System Robustness in Heterogeneous GPU cluster

The model used in this part is larger and the amount of data generated that needs to be exchanged is larger (because in the GPU GPUtraining on GPU , so a larger model was chosen), that is, the network became an obvious bottleneck.

F i g u r e   12 Figure\ 12 F i gu re 12  is also the accuracy achieved after fixed training time:

Insert image description here

5.2.3 Heterogeneous Compute Resources

This section evaluates the situation where the network resources are the same but the computing resources are different.

F i g u r e   13 Figure\ 13 F i g u re 13  is the accuracy that can be achieved by each method at fixed time in the environment where the network resources of each node are the same:

Insert image description here

F I g u r e   14 FIgure\ 14 F I gu re 14  is the time required to achieve a certain accuracy, whereDL ion − no − DBWU DLion-no-DBWUDLionnoD B W U represents not using dynamic batch selection,DL ion − no − WU DLion-no-WUDLionnoW U represents no weighted aggregation update:

Insert image description here

5.2.4 Heterogeneous Network Resources

This section evaluates the situation where the network resources are different but the computing resources are the same.

F i g u r e   15 Figure\ 15 F i g u re 15  shows the accuracy that can be achieved with fixed time on different systems:

Insert image description here

F i g u r e   16 Figure\ 16 F i gu re 16  isMax − 10 Max-10Max10(即 B a s e l i n e Baseline B a se l in eincreaseM ax − 10 Max-10Max10 ) Comparison with other methods (fixed time):

Insert image description here

5.2.5 Deviation of Model Accuracy

F i g u r e   17 Figure\ 17 F i g u re 17  shows the standard deviation of the model accuracy:

Insert image description here

5.2.6 Dynamic Resource Changes

The computing resources and network resources of this part of the environment will change dynamically.

F i g u r e   18 Figure\ 18 F i gu re 18  is the experimental result:

F i g u r e   19 Figure\ 19 F i gu re 19  shows the change process of the local batch size of each node during the training process:

Insert image description here

F i g u r e   20 Figure\ 20 F i g u re 20  shows the change process of selecting the updated gradient size during the training process:

Insert image description here

5.2.7 Effect on Improving Model Accuracy

This section focuses on the final convergence accuracy and convergence time of the model.

F i g u r e   21 Figure\ 21 F i gu re 21  shows the final convergence accuracy and convergence time of each method:

Insert image description here

6 Related Work

The related work involved is as follows:

  • Distributed deep learning;
  • Resource-aware distributed deep learning;
  • federated learning;
  • Deep learning inference for edge devices;
  • parallel computing;
  • Data analysis based on geographical distribution.

7 Conclusion

A distributed machine learning architecture suitable for micro-clouds is proposed. It mainly improves speed through three aspects:

  • W e i g h t e d   d y n a m i c   b a t c h i n g Weighted\ dynamic\ batching W e i g h t e d y nami c ba t c hin g   : Maximize data parallelism ;
  • P e r − l i n k   p r i o r i t i z e d   g r a d i e n t   e x c h a n g e Per-link\ prioritized\ gradient\ exchange Perl ink p r i or i t i ze d g r a d i e n t e x c ​​han g e    : Reduce communication costs;
  • D i r e c t   k n o w l e d g e   t r a n s f e r Direct\ knowledge\ transfer D i rec t kn o wl e d g e t r an s f er   : Improve model accuracy.

Also using TensorF low TensorFlowT e n sor Fl o w builtD lion DlionD lion 's prototype . _ Compared with the current architecture, it has a good speed improvement and accuracy improvement.

Guess you like

Origin blog.csdn.net/qq_45523675/article/details/129365344