FLMix: A new paradigm of federated learning - the combination of local and global

文章链接:Federated Learning of a Mixture of Global and Local Models

Journal published (conference): ICLR 2021 Conference (top machine learning conference)

This blog studies traditional federated learning from the perspective of ◊ \Diamond 优化函数 and new federated learning ♣ \clubsuit difference

1. Background introduction

In an article published by King Abdullah University of Science and Technology, two scholars, Philip and Peter, conducted an early mathematical discussion on the combination of Federated Learning and Mixed Experts (MoE).

What is interesting is that the research motivation of these two scholars is to protect their mobile device data from being exposed, and at the same time, they can also use this data for machine learning. They gave two very simple reasons.

  • First, many device users are increasingly sensitive to privacy concerns and prefer their data to never leave their devices.
  • Second, moving data from their place of origin to a centralized location is very inefficient in terms of energy and time.

    One reason is that it is unsafe, and another reason is that it is inconvenient.

2. Traditional federated learning

To date,FL has emerged as an interdisciplinary field focused on training machine learning models directly on edge devices< a i=3>to solve the problem. In traditional FL framework, each client participates in FL training.

Parameter definition: Number of training customers N; global model structure M G M_{G} MG; Global model parameters θ ( d 1 ) dimension \theta (d_{1}) dimension θ(d1)
其中 θ ∈ R d 1 \theta \in \mathbb{R}^{d_{1}} iRd1 and R d 1 ∈ R \mathbb{R}^{d_{1}} \in \mathbb{R} Rd1R
FL的学习目标为:
◊ min ⁡ θ ∈ R d 1 F ( θ ) = 1 N ∑ i = 1 N f i ( θ ) \Diamond \quad \min_{\theta \in \mathbb{R}^{d_{1}}} F(\theta) =\frac{1}{N} \sum_{i=1}^{N} f_{i}(\theta) θRd1minF(θ)=N1i=1Nfi(θ)
对于每一个 f i f_{ i} fi, due to different data distribution, assume that i i The data distribution of i customers is defined as D i \mathcal{D} _{i} Di 则:
f i ( θ ) = E ( x , y ) ∼ D i [ f ( x , ξ ) ] f_{i}(\theta)=\mathbb{E}_{(x,y)\sim\mathcal{D}_{i}} [f(x,\xi)] fi(θ)=AND(x,y)Di[f(x,ξ)]
inside f i ( ⋅ ) f_{i}(·) fi()is a guest i i i target lapse function

Solve F ( θ ) F(\theta) F(θ) The most popular method is FedAvgalgorithm, in its FedAvgsimplest form, that is, when no partial participation, model compression, or stochastic approximation is used, FedAvg Reduced to local gradient descent (LGD). This is an extension of GD that performs multiple gradient steps per device before aggregating.

FedAvg has been proven to be empirically effective, especially for non-convex problems (problems with multiple local minima) . But when the data is heterogeneous, FedAvg's convergence guarantee is poor compared to its non-local counterpart.

Although FL has many theories proving its feasibility, its final result is global. We need to think about, for those individuals with heterogeneous data, Is it necessarily good to use global solutions to solve individual problems?

The answer is no, data heterogeneity is not only solved by designing new training methods ◊ \Diamond raises challenges and inevitably questions the utility of such global solutions for individual users. In fact, a global model trained on all data from all devices may be removed from the typical data and usage patterns experienced by individual users to the point of rendering them virtually useless.


3. FL new paradigm

This article proposes a new optimization formula for training federated learning models. The standardFLaims to find a single global model from the private data stored on all participating devices. In contrast, the new approachseeks a trade-off between global and local models where each device can learn from its own private data without communication .

This paper develops efficient stochastic gradient descent (SGD) variants to solve the new formulation and demonstrates guarantees on communication complexity. The main contributions of this work includea new paradigm that combines global and local modelsFL, Theoretical properties of the new paradigm, Loop-free local gradient descent (L2GD), Convergence theory of L2GDandInsights on the role of local steps in federated learning. The paper also highlights the potential of local SGD to outperform traditional SGD in terms of communication complexity and the benefits of personalized federated learning.

The new paradigm of training supervised federated learning proposed in this article is as follows:

♣ min ⁡ x 1 , . . . , x n ∈ R d { F ( x ) : = f ( x ) + λ ψ ( x ) } f ( x ) : = 1 n ∑ i = 1 n f i ( x i ) ψ ( x ) : = 1 2 n ∑ i = 1 n ∥ x i − x ‾ ∥ 2 \clubsuit \quad \min_{x_1,...,x_n \in \mathbb{R}^d } \{ F(x): = f(x)+ \lambda \psi (x)\} \\ f(x):=\frac{1}{n}\sum_{i=1}^{n} f_i(x_i) \\ \psi (x) := \frac{1}{2n}\sum_{i=1}^{n} \left \| x_i-\overline{x} \right \| ^2 x1,...,xnRdmin{ F(x):=f(x)+λψ(x)}f(x):=n1i=1nfi(xi)ψ(x):=2n1i=1nxix2 inside λ ≥ 0 \lambda \ge0 l0 is a penalty hyperparameter, x 1 , . . . , x n ∈ R d x_1,...,x_n \in \mathbb{R}^d x1,...,xnRd is the local model parameter, x : = ( x 1 , x 2 , . . . , x n ) ∈ R n d x:= (x_1,x_2,...,x_n) \in\mathbb{R}^{nd} x:=(x1,x2,...,xn)Rnd 并且 x ‾ : = 1 n ∑ i = 1 n x i \overline{x}:=\frac{1}{n}\sum_{i=1}^{n}x_i x:=n1i=1nxiis the average of all local models.

Text introduction f i f_i fi obtainable F F F is a strongly convex function. A convex function is a function whose second derivative is always positive (negative), and the local minimum is the global minimum. For ◊ \Diamond has a unique solution. This solution can be expressed as:
x ( λ ) : = ( x 1 ( λ ) , . . . , x n ( λ ) ) ∈ R n d x(\lambda ):=(x_1(\ lambda),...,x_n(\lambda))\in\mathbb{R}^{nd} x(λ):=(x1(λ),...,xn(λ))RndAdhesion possible x ~ ( λ ) : = 1 n ∑ i = 1 n x i ( λ ) \overline{x}(\lambda):=\frac{1}{n}\sum_{i=1}^{n} x_i(\lambda) x(λ):=n1i=1nxi(λ)


theoretical logic

The proposed paradigm ♣ \clubsuit 's theory:

  • Local models ( λ = 0 \lambda=0l=0): At this time, the model degenerates into a local model, and only needs to minimize the local loss, that is, solve min ⁡ x i ∈ R d f i ( x i ) \min_{x_i \in \mathbb{R}^d } f_i(x_i) xiRdminfi(xi)也到说, x i ( 0 ) x_i(0) xi(0) 仅基于存储在设备 i i i 上的数据 D i D_i Di 的局部模型。该模型可以由设备 i i i 计算,而无需任何通信。通常情况下, D i D_i Di 不够丰富,无法使用此本地模型。为了学习更好的模型,还必须考虑其他客户的数据。然而,这存在沟通成本。
  • Mixed models ( λ ∈ ( 0 , ∞ ) \lambda\in(0,\infty) λ(0,)):随着 λ \lambda λ 的增加,惩罚 λ ψ ( x ) \lambda \psi (x) λψ(x) 的效果越来越明显,需要沟通以确保模型不会太不相似,否则惩罚 λ ψ ( x ) \lambda \psi (x) λψ(x) 会增大。
  • Global model ( λ = ∞ \lambda=\infty λ=):现在我们来看 λ → ∞ λ→∞ λ 的极限情况。直观上,这种极限情况应该迫使最优局部模型之间是相同的,同时最小化损失 f f f,即让 ψ ( x ) → 0 \psi(x) \rightarrow0 ψ(x)0 ψ ( x ) : = 1 2 n ∑ i = 1 n ∥ x i − x ‾ ∥ 2 \psi (x) := \frac{1}{2n}\sum_{i=1}^{n} \left \| x_i-\overline{x} \right \| ^2 ψ(x):=2n1i=1nxix2此时,这种情况有一个特殊的极限解: min ⁡ { f ( x ) : x 1 , . . . , x n ∈ R d , x 1 = ⋯ = x n } \min\{ f(x):x_1,...,x_n\in \mathbb{R}^d ,x_1=\cdots=x_n \} min{ f(x):x1,...,xnRd,x1==xn}。可以反证,如果 λ = ∞ \lambda=\infty λ= 并且 x 1 = x 2 = ⋯ = x n x_1=x_2=\cdots =x_n x1=x2==xn不成立,那么 F ( x ) = ∞ F(x) = \infty F(x)=

重要假设

对于每一个设备 i i i ,它的目标函数 f i : R d → R f_i:\mathbb{R}^d \rightarrow \mathbb{R} fi:RdR L − s m o o t h L-smooth Lsmooth 并且 μ − s t r o n g l y \mu -strongly μstrongly 的凸函数。

  • L − s m o o t h L-smooth Lsmooth:通常用来描述一个函数的平滑程度。一个函数被称为是 L-smooth 的,如果它的一阶导数(梯度)是 Lipschitz 连续的,即梯度的变化受到了一定的约束。
    如果存在一个常数 L > 0 L>0 L>0,使得函数 f f f 的梯度 ∇ f ( x ) ∇f(x) f(x) 对于任意的 x x x y y y 满足以下不等式: ∥ ∇ f ( x ) − ∇ f ( y ) ∥ ≤ L ∥ x − y ∥ ∥∇f(x)−∇f(y)∥≤L∥x−y∥ ∥∇f(x)f(y)Lxy ∥ ⋅ ∥ ∥⋅∥ is the norm of the vector. This definition shows that the gradient change of the function is affected by L L The limit of L means that the gradient change at adjacent points on the function surface is bounded.
  • μ − s t r o n g l y \mu -strongly mstrong ly: Describes the degree of curvature of a function, referring to the extent to which a function is more strongly curved than a convex function . If there is a constant μ > 0 \mu>0 m>0 , which satisfies the following inequality: f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x  + μ 2 ​∥ y − x ∥ 2 f(y)≥f(x)+⟨∇f(x),y−x +\frac{μ}{2}​∥y−x∥^2 < /span>f(y)f(x)+f(x),andx+2μ​∥yx2 ⟨ ⋅ , ⋅ ⟩ ⟨⋅,⋅⟩ ,  represents the inner product operation. This inequality shows that the function f f f Any point x x The curvature at x is at least μ μ μ, that is, the function image is sufficiently curved in the local area.

L − s m o o t h L-smooth Lsmooth The characteristics of function make the solution in optimization problems more feasible and stable. Because functions with Lipschitz continuous gradients are more likely to converge to the local optimal solution for optimization algorithms such as gradient descent, avoiding oscillation or divergence caused by dramatic gradient changes. Ensure convergence

μ − s t r o n g l y \mu -stronglymstrong ly The function has a strict lower bound in the local area. This feature allows the optimization algorithm to converge to the global maximum more quickly. Excellent solution. Accelerate convergence


Solution properties

对于 ♣ \clubsuit The optimal solution of ♣ should have the following three characteristics:

We will characterize two functions, local and global f ( x ( λ ) ) f(x(\lambda)) f(x(λ)) ψ ( x ( λ ) ) \psi(x(\lambda)) ψ(x(λ )) 视工关于变数 λ \lambda λ function.

  • 特性一 ψ ( x ( λ ) ) \psi(x(\lambda)) ψ(x(λ )) Please take advantage of it. ∀ λ > 0 \forall\lambda>0 λ>0 ψ ( x ( λ ) ) ≤ f ( x ( ∞ ) ) − f ( x ( 0 ) ) λ ψ (x(λ)) ≤\frac{ f(x(∞))−f(x(0))}{\lambda} ψ(x(λ))lf(x())f(x(0))One step f ( x ( λ ) ) f(x(\lambda)) f(x(λ )) Definitely, that's why f ( x ( ∞ ) ) ≥ f ( x ( λ ) ) f(x(∞ ))\ge f(x(\lambda)) f(x())f(x(λ )).

    Above expression child assertion: Adjudication λ \lambda λ 目增大,惩罚项 ψ ( x ( λ ) ) ψ(x(λ)) ψ(x(λ )) The best local model is 0 , this is the best local model xi(λ) λ \lambda The growth of λ is increasingly similar. At the same time, according to the second expression, f ( x ( λ ) ) f(x(\lambda)) f(x(λ)) λ \lambda λ increases, but does not exceed the optimal global loss of the standardFL formula f ( x ( ∞ ) ) f(x(∞)) f(x())
  • 特性二:对于 ∀ λ > 0 \forall\lambda>0 λ>0 and 1 ≤ i ≤ n 1\le i \le n 1in We can get the following optimal local solution representation: x i ( λ ) = x ˉ ( λ ) − 1 λ ∇ f i ( x i ( λ ) ) x_i(λ) = \bar{x}(λ) − \frac{1}{λ}∇f_i(x_i(λ)) xi(λ)=xˉ(λ)l1fi(xii = 1 n ∇ f i ( x i ( λ ) ) = 0 ψ ( x ( λ ) ) = 1 2 λ 2 ∣ ∣ ∇ f ( x ( λ ) ) ∣ ∣ 2 \sum_{i=1}^{n}\ nabla f_i(x_i(\lambda))=0 \\ \psi (x(\lambda))=\frac{1}{2\lambda^2}||\nabla f(x(\lambda)) ||^ 2i=1nfi(xi(λ))=0ψ(x(λ))=2λ21∣∣∇f(x(λ))2Subtract the multiple of the local gradient from the average model to obtain the optimal local model. In the optimal state, the sum of local gradients is always zero. This pair λ = ∞ λ =∞ l= is obviously correct, but this is true for ∀ λ > 0 \forallλ > 0 λ>0 Akiaki Tsufuta.
  • 特性三: From the best local model O ( 1 / λ ) O(1/\lambda) O(1/λ) P ( z ) : = 1 n ∑ i = 1 n f i ( z ) P(z):=\frac{1}{n} { \textstyle \sum_{i=1}^{n}}f_i(z) Order FLsolution. 's speed is
    P(z):=n1i=1nfi(z) , at this time x ( ∞ ) x(\infty) x() P P The unique minimum value of P can be obtained: ∣ ∣ ∇ P ( x ˉ ( λ ) ) ∣ ∣ 2 ≤ 2 L 2 λ ( f ( x ( ∞ ) ) − f ( x ( 0 ) ) ) ||∇P(\bar{x}(λ))||^2 ≤\frac{2L^2}{λ}(f( x(∞)) − f(x(0))) ∣∣∇P(xˉ(λ))2l2L2(f(x())f(x(0)))

Insert image description here ♣ \clubsuit solution x ( λ ) x(λ) x(λ) 到纯局部解 x ( 0 ) x(0) x(0) Japanese chiropractic solution < /span> x ( ∞ ) x(∞) x() 的距离是 λ λ λ function.


Guess you like

Origin blog.csdn.net/cold_code486/article/details/134400124
Recommended