[Literature Reading] Adaptive Federation Optimization

        In this work, federated versions of adaptive optimizers including ADAGRAD, ADAM, and YOGI are proposed, and their convergence in the presence of heterogeneous data in general non-convex settings is analyzed.


        The classic three assumptions are Lipschitz gradient, stochastic gradient variance is bounded, and gradient value is bounded:

         Some symbolic expressions are defined in the middle, such as the representation of the weight parameter x, the representation of the pseudo gradient, and so on. Then the following is the algorithm:

        It can be seen that the most important parts are the two functions ClientOpt and ServerOpt, which can replace many operations. This makes the FedOpt algorithm a framework. In the following we can see that the ClientOpt function is mostly SGD stochastic gradient optimization, and ServerOpt can have many options.

        The following is the pseudocode of the algorithm:

         Among them \ can, the adaptive degree of the algorithm is controlled, and the smaller the value, the higher the adaptive degree. \Delta^t_iStores the amount of change in the weight of the local within a local cycle. \Delta_tis the average of the variation for each customer. More important is the calculation of the m_tsum v_t:

  •         m_tis defined as \Delta_tthe momentum, \beta_1controlled by the momentum scale.
  •         v_tIt changes with the optimization method to realize the function of self-adaptation. 

 

Guess you like

Origin blog.csdn.net/m0_51562349/article/details/128256105