Paper notes ASYNCHRONOUS FEDERATED OPTIMIZATION

Paper notes ASYNCHRONOUS FEDERATED OPTIMIZATION

An asynchronous federation optimization algorithm is proposed in the paper.

The federation-optimized synchronization feature is not scalable, inefficient, and inflexible. Too many devices checked in at the same time will cause network congestion on the server side. In each global epoch, the server is limited to selecting from a subset of available devices to trigger the training task.

The classic asynchronous SGD sends the gradient directly to the server after each local update, which is not feasible for the edge device because its communication is unreliable and slow.

 

The thesis uses asynchronous optimization and uses weighted average to update the global model.

Each client device has a worker who can start local training and upload the local training results to the server. The server and worker perform updates asynchronously, and the communication between the two is non-blocking.

 

algorithm:

                                                                                                

Choose to use the function s (t-τ) to determine the value of α. We list some options for s (t-τ), parameterized as a> 0, b≥0:

Experimental results

 

 

 

 

 

in conclusion

  1. When the overall staleness is small, the convergence speed of FedAsync is as fast as SGD and faster than FedAvg. When staleness is large, FedAsync converges slowly. In the worst case, the convergence speed of FedAsync is similar to FedAvg. When α is too large, convergence may be unstable. With adaptive α, convergence is robust to large α. Please note that when the maximum staleness is 4, FedAsync and FedAsync + Hinge (b = 4) are the same.
  2. In the case of the same communication overhead, when staleness is very small, the convergence speed of FedAsync is faster than FedAvg. When staleness is large, the performance of FedAsync is similar to FedAvg.
  3. Greater staleness will slow down the convergence speed, but the impact is not catastrophic. In addition, the use of adaptive hybrid hyperparameters can reduce the instability caused by large lags.
  4. In general, FedAsync is robust to different α. Please note that the difference is so small that we must zoom in. When staleness is small, adaptive hybrid hyperparameters are unnecessary. When staleness is larger, smaller α is better for FedAsync, while larger α is better for FedAsync + Poly and FedAsync + Hinge. That is because the adaptive α is automatically adjusted to be smaller when the staleness is larger, so we should not reduce α manually.

Personal interpretation

The FedAVG algorithm selects a client with a proportion of C for local training each time, and uploads the training results to the server. Each time, only the currently available equipment will be selected to trigger the training task.

FedAsync in this article can trigger training regardless of whether the device is currently available. The server will not wait until the worker responds. Workers that are currently unavailable can wait for later to start training tasks.

In the FedAVG algorithm, the server waits for the number of responding clients to reach C * n before performing a weighted average to find the global model parameters; while FedAsync does not wait, when the workers send parameters (model parameters and staleness), the weighted average is only updated The global model is influenced by staleness, and the mixed parameter α is used to control the weight.

Understanding of staleness:

The number t is the number of epochs where the server received the parameter

Number τ: indicates that the parameter sent by the server to the worker is used to initialize the parameter in the τth epoch

So the defined staleness should be the epoch to which this parameter was originally applied and the epoch to be actually used

In other words, because the server and worker are not working synchronously, the server will not wait until the worker responds. Due to the communication delay of the device, after receiving the parameter xt-1 of the server, the device performs local training and obtains the local parameter x 'and uploads it to the server. However, when the server receives this parameter, it may have updated the results of this iteration. The greater the delay, the greater the impact on the results. Therefore, a hybrid hyperparameter α is proposed in this paper to control the delay parameter's weight to update.

This mechanism makes it more flexible and scalable, and greatly reduces network congestion.

 

If you have any different understanding, please comment and discuss and study together!

Published 36 original articles · 19 praises · 20,000+ views

Guess you like

Origin blog.csdn.net/GJ_007/article/details/105121452