In this work, federated versions of adaptive optimizers including ADAGRAD, ADAM, and YOGI are proposed, and their convergence in the presence of heterogeneous data in general non-convex settings is analyzed.
The classic three assumptions are Lipschitz gradient, stochastic gradient variance is bounded, and gradient value is bounded:
Some symbolic expressions are defined in the middle, such as the representation of the weight parameter x, the representation of the pseudo gradient, and so on. Then the following is the algorithm:
It can be seen that the most important parts are the two functions ClientOpt and ServerOpt, which can replace many operations. This makes the FedOpt algorithm a framework. In the following we can see that the ClientOpt function is mostly SGD stochastic gradient optimization, and ServerOpt can have many options.
The following is the pseudocode of the algorithm:
Among them , the adaptive degree of the algorithm is controlled, and the smaller the value, the higher the adaptive degree. Stores the amount of change in the weight of the local within a local cycle. is the average of the variation for each customer. More important is the calculation of the sum :
- is defined as the momentum, controlled by the momentum scale.
- It changes with the optimization method to realize the function of self-adaptation.