[Literature Learning] Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge

        To address the reality of resource constraints on edge devices, this paper redefines FL as a group knowledge transfer training algorithm called FedGKT. FedGKT designed a variant of the alternating minimization method to train a small CNN on edge nodes and periodically transfer its knowledge to a large server-side CNN through knowledge distillation.

        It reduces the need for edge computing, reduces the communication bandwidth of large CNNs, and trains asynchronously while maintaining model accuracy comparable to FedAvg. The results show that FedGKT can achieve comparable or even slightly higher accuracy than FedAvg. What's more, FedGKT makes edge training affordable. Compared to edge training with FedAvg, FedGKT requires 9x to 17x less computing power (FLOPs) on edge devices and 54x to 105x fewer parameters in edge CNNs.


        FedGKT can transfer knowledge from many compact CNNs trained on the edge to large CNNs trained on cloud servers. The essence of FedGKT is to redefine FL as an alternate minimization (AM) method, which optimizes two random variables (edge ​​model and server model) by alternately fixing one random variable and optimizing the other random variable.

        In general, we can formulate CNN-based federated learning as a distributed optimization problem:

         The authors point out that the main drawback of existing federated optimization methods lies in the lack of GPU accelerators and sufficient memory to train large CNNs on resource-constrained edge devices.


        To address the resource constraints in existing FL, consider another way to solve the FL problem: splitting the weights W into a small feature extractor We and a large-scale server-side model Ws. We also add a classifier Wc to We to create a small but fully trainable model on the margin. We therefore reformulate single global model optimization as a non-convex optimization problem that requires us to solve both the server model Fs and the edge model Fc .

         Note that Equation 5 can be solved independently by the client. For large CNN training, the communication bandwidth for transferring H^(k)_i to the server is much smaller than the bandwidth for communicating all model parameters in traditional federated learning. I didn't understand these formulas at first, I think it needs to be analyzed in combination with the following figure:

         The change from formula 4 to formula 5 is actually another way of saying it, the former is described by f, and the latter is described by fc. The server model fs uses H(k)i as input features for training.

        A core advantage of the above reformation is that marginal training is affordable when we assume that the model size of f(k) is orders of magnitude smaller than fs.

        Intuitively, the knowledge transferred from the server model can facilitate the optimization at the edge (Eq. (5)). A server CNN absorbs knowledge from multiple edges, and a single edge CNN obtains enhanced knowledge from a server CNN:

         The KL divergence (D_KL) is used here. ls and lc are the probabilistic predictions of the edge model f(k) and the server model fs, respectively. zs and z(k)c are the output of the last fully connected layer in the server model and client model, respectively. The next paper proposes a variant of Alternating Minimization (AM) to solve the reformulated optimization problem:

         (Why is there a k in the second input parameter of (8)?)

        The * superscript in the above equation indicates that the relevant random variables are fixed during the optimization process. W(k) is a combination of W(k)e and W(k)c.

        In (8), we fix W(k) and optimize (train) W s for a few cycles, then we switch to (10) to fix W s and optimize W(k) for a few cycles. This optimization is performed in many rounds between equations (8) and (10) until a state of convergence is reached.

 

Guess you like

Origin blog.csdn.net/m0_51562349/article/details/128268828