联邦学习FedAvg记录

Notation

符号 含义
\(F(w)\) 总目标函数
\(w\) 待优化参数
\(F_k(w)\) \(k\)个client的目标函数
\(p_k\) \(k\)个client在总表函数中的占比
\(E\) 每个local update的次数
\(T\) 总迭代次数,即通讯次数为\(T/E\)
\(\eta_t\) \(t\)时刻学习率
\(L\) 函数\(F_k\)\(L-smooth\),即\(\nabla^2 F_k(w)\leq L\)
\(\mu\) 函数\(F_k\)\(\mu\)convex,即\(\nabla^2 F_k(w)\geq u\)
\(\sigma_k\) \(E(\Vert\nabla F_k(w_t^k, \xi^k_t)-\nabla F_k(w_t^k)\Vert)\leq\sigma^2_k\)
\(G\) \(E(\Vert\nabla F_k(w_t^k, \xi^k_t)\Vert)\leq G^2\)
\(F^*\) 最优函数值
\(F_k^*\) 单独对第\(k\)个client优化得到的最优函数值
\(\Gamma\) \(F^*-\sum_k\,p_kF_k^*\),度量异质性
\(\eta_t\) \(t\)时刻的学习率
\(\kappa\) \(\frac{L}{\mu}\),可近似看为条件数
\(w^*\) 最优参数
\(\xi_t^k\) client \(k\)\(t\)时刻进行随机梯度下降选出的样本

假设

  1. 函数\(F_k\)\(L-smooth\),对于所有的\(k\)
  2. 函数\(F_k\)\(\mu-convex\),对于所有的\(k\)
  3. client\(k\)每次计算的随机梯度的方差是\(\sigma^2_k\)有界的
  4. 所有计算的随机梯度的范数是\(G\)有界的

全参与下的收敛性证明

引入两个变量,\(v^k_t\)\(w_t^k\)

\[\begin{align*} v_{t+1}^k &= v_t^k - \eta_t\nabla F_k(w_t^k, \xi_t^K)\\ w_{t+1}^k &= \left\{ \begin{matrix} v_{t+1} \quad &\text{for }t+1 \notin I_E\\ \sum_k^{N}\,p_k v_{k+1}^k \quad &\text{for }t+1 \in I_E\end{matrix} \right. \end{align*} \]

定义

\[\begin{align*} \bar v_t &= \sum\nolimits_k \, p_k v_t^k\\ \bar w_t &= \sum\nolimits_k \, p_k w_t^k\\ \bar g_t &= \sum\nolimits_k\, p_k\nabla F_k(w_t^k)\\ g_t &=\sum\nolimits_k p_k\nabla F_k(w_t^k, \xi_k^t) \end{align*} \]

由于\(t\in I_E\),在能够交换的迭代轮次才能获取参数的更新\(w\),变量\(v\)用来表示在不能进行交换数据的轮次的参数。由于全局参与,\(\bar v_t = \bar w_t\)对于所有的\(t\),而且\(\bar v_{t+1} = \bar w_t - \eta_t g_t\)

个人理解:

要证明收敛性需要证明,参数是收敛的,由于参数\(\bar w_t\)是根据梯度下降求出来的,所以需要证明,

\[E\Vert \bar w_{t+1} - w^*\Vert \leq l(\bar w_{t} - w) \]

即当前迭代的参数和最优点的\(w^*\)的距离是小于上一次迭代参数与最优参数的距离,而且\(l\)函数是可以递推的。也就是说,当前迭代的参数和最优点的\(w^*\)的距离的上界是逐渐减小的。

文章没有选择\(\bar w - w^*\)而是选择了\(\bar v - w^*\),因为\(\bar v\)是对应所有client的,在部分参与的场景下,\(\bar w\)是偏差的。

\[\begin{align*} E\Vert \bar v_{t-1} - w^*\Vert^2 &= E\Vert \bar w_t - \eta_t g_t -w^*-\eta_t \bar g_t + \eta_t \bar g_t\Vert^2\\ &= E\left(\Vert \bar w_t -\eta _t \bar g_t -w^*\Vert ^2 + 2\eta_t <\bar w_t-\eta_t\bar g_t -w^*, -g_t+\bar g_t> + \eta_t^2 \Vert\bar g_t-g_t\Vert^2\right) \end{align*}\label{eq:1} \tag{1} \]

上面之所以拆分出\(-\eta_t \bar g_t + \eta_t \bar g_t\)的原意是想利用\(E(\eta_t g_t - \eta_t \bar g_t)=0\)来对\(\ref{eq:1}\)进行拆分。

对于\(E\left(\Vert \bar w_t -\eta _t \bar g_t -w^*\Vert ^2\right)\)继续进行计算

\[\begin{align*} \Vert \bar w_t - \eta_t \bar g_t - w^*\Vert^2&= \Vert \bar w_t - w^*\Vert^2 - 2\eta_t<\bar w_t-w^*, \bar g_t> +\eta_t^2\Vert \bar g_t\Vert^2\\ \label{eq:2}\tag{2} \end{align*} \]

根据\(L-smooth\)[1]

\[\begin{align*} \Vert \nabla F_k(w_t^k)\Vert^2 \leq 2L(F_k(w_t^k - F_k^*)) \label{eq:3} \tag{3} \end{align*} \]

因为二范数为凸函数再结合\(~\ref{eq:3}\),得到

\[\begin{align*} \eta^2_t \Vert \bar g_t \Vert^2 &\leq \eta_t^2 \sum \, p_k\Vert \nabla F_k(w_t^k)\Vert^2\\ &\leq 2L\eta_t^2(F_k(w_t^k - F_k^*)) \end{align*} \]

对于\(2\eta_t <\bar w_t -w^*, \bar g_t>\)展开

\[\begin{align*} -2\eta_t <\bar w_t -w^*, \bar g_t>&=-2\eta_t \sum\, p_k <\bar w_t-w^*, \nabla F_k(w^k_t)> \\ &= -2\eta_t \sum\, p_k <\bar w_t - w_t^k, \nabla F_k(w_t^k)>-2\eta_t\sum\, p_k < w_t^k - w_t^k, \nabla F_k(w_t^k)> \end{align*} \]

根据可惜施瓦茨不等式和矩阵不等式得到

\[\begin{align*} -2<\bar w_t-w_t^k, \nabla F_k(w_t^k) > \leq \frac{1}{\eta_t} \Vert \bar w_t - w_t^k\Vert^2+\eta_k \Vert \nabla F_k(w_t^k)\Vert^2 \end{align*} \]

根据\(\mu-convex\)得到

\[\begin{align*} -<w_t^k - w^*, \nabla F_k(w_t^k)> \leq -(F_k(w_t^k)-F_k(w^*)) - \frac{\mu}{2}\Vert w_t^k-w^*\Vert^2 \end{align*} \]

因此\(~\ref{eq:2}\)写为

\[\begin{align*} \Vert \bar w_t - \eta_t \bar g_t - w^*\Vert^2&= \Vert \bar w_t - w^*\Vert^2 - 2\eta_t<\bar w_t-w^*, \bar g_t> +\eta_t^2\Vert \bar g_t\Vert^2\\ & \leq \Vert \bar w_t - w^*\Vert^2 + 2L\eta_t^2(F_k(w_t^k - F_k^*)) + \\ &\quad\eta_t\sum\, p_k\left(\frac{1}{\eta_t} \Vert \bar w_t - w_t^k\Vert^2+\eta_k \Vert \nabla F_k(w_t^k)\Vert^2\right) -\\ &\quad2\eta_t \sum\, p_k(F_k(w_t^k)-F_k(w^*) - \frac{\mu}{2}\Vert w_t^k-w^*\Vert^2)\\ & \leq(1-\mu\eta_t)\Vert \bar w_t- w^*\Vert^2 + \sum\, p_k \Vert \bar w_t-w^k_t\Vert^2+ \\&\quad4L\eta_t^2 \sum \, p_k(F_k(w_t^k)-F_k^*) - 2\eta_t\sum\, p_k(F_k(w_t^k)-F_k(w^*)) \end{align*} \]

拿出后面的\(4L\eta_t^2 \sum \, p_k(F_k(w_t^k)-F_k^*) - 2\eta_t\sum\, p_k(F_k(w_t^k)-F_k(w^*))\)一项,定义\(\gamma_t=2\eta_t(1-2L\eta_t)\),同时假定\(\eta_t\)随时间非增(也就是学习率是衰减的)且\(\eta_t\leq \frac{1}{4L}\)。可以得到定义的\(\eta_t\leq\gamma_t\leq 2 \eta_t\)。整理得到

\[\begin{align*} &4L\eta_t^2 \sum \, p_k(F_k(w_t^k)-F_k^*) - 2\eta_t\sum\, p_k(F_k(w_t^k)-F_k(w^*))\\ &=-2\eta_t(1-2L\eta_t)\sum \, p_k(F_k(w_t^k)-F_k^*)+2\eta_t\sum\, p_k(F_k(w^*)-F_k*)\\ &=-\gamma_t \sum \, p_k(F_k(w_t^k)-F^*) + (2\eta_t - \gamma_t)\sum\, p_k(F^*-F_k^*)\\ &=-\gamma_t \sum\, p_k(F_k(w_t^k)-F^*) + 4L\eta_t^2\Gamma \end{align*} \]

第三个等号将\(F_k(w_t^k-F_k^*)\)拆分为\(F_k(w_t^k)-F^*+F^* - F_k^*\)

\[\begin{align*} \sum\, p_k(F_k(w_t^k)-F^*) & = \sum\, p_k (F_k(w_t^k)-F_k(\bar w_t)) + \sum\, p_k(F_k(\bar w_t)-F^*)\\ & \geq \sum\, p_k <\nabla F_k(\bar w_t), w_t^k-\bar w_t> +F_k(\bar w_t)-F^*\\ &\geq -\frac 1 2\sum\, p_k \left[\eta_t\Vert \nabla F_k(\bar w_t)\Vert^2 + \frac{1}{\eta}\Vert w_t^k -\bar w_t\Vert^2\right] + (F(\bar w_t)-F^*)\\ &\geq -\sum\, p_k \left[\eta_t L(F_k(\bar w_t) - F_k^*) + \frac{1}{2\eta}\Vert w_t^k -\bar w_t\Vert^2\right] + (F(\bar w_t)-F^*) \end{align*} \]

综上所述,

\[\begin{align*} &4L\eta_t^2 \sum \, p_k(F_k(w_t^k)-F_k^*) - 2\eta_t\sum\, p_k(F_k(w_t^k)-F_k(w^*))\\ &\leq \gamma_t \sum\, p_k \left[\eta_t L(F_k(\bar w_t) - F_k^*) +\frac{1}{2\eta}\Vert w_t^k -\bar w_t\Vert^2\right]-\gamma(F(\bar w_t)-F^*)+ + 4L\eta_t\Gamma\\ & = \gamma_t(\eta_tL-1)\sum \, p_k(F_k(\bar w_t) - F^*) + (4L\eta_t^2+\gamma_t\eta_tL)+\frac{\gamma}{2\eta_t}\sum \, p_k \Vert w_t^k - \bar w_t\Vert^2\\ &\leq 6L\eta_t^2\Gamma + \sum\, p_k \Vert w_t^k - \bar w_t\Vert^2 \end{align*} \]

最后一个不等式取得的原因是\((\eta_tL-1)\leq -\frac{3}{4}\)\(\sum \, p_k(F_k(\bar w_t) - F^*)\geq0\)

因此

\[\Vert \bar w_t -\eta _t \bar g_t -w^*\Vert ^2\leq (1-\mu \eta_t)\Vert \bar w_t- w^*\Vert^2 +6L\eta_t^2\Gamma + \sum\, p_k \Vert w_t^k - \bar w_t\Vert^2 \]

加上梯度方差有限假设,即

\[\begin{align*} E\Vert g_t - \bar g_t\Vert^2 &= E\Vert \sum \, p_k(\nabla F_k(w_k, \xi_t^k) - \nabla F_k(w_t^k))\Vert\\ & \leq \sum\, p_k^2 \sigma_k^2 \end{align*} \]

\[\begin{align*}E\sum\, p_k \Vert \bar w_t - w^k_t\Vert^2 &= E\sum \,p_k\Vert w^k_t-\bar w_{t_0}-(\bar w_t - \bar w_{t_0})\Vert^2\\ &\leq E\sum\, p_k \Vert w_t^k - \bar w_{t_0}\Vert^2\\ &\leq \sum p_k E\sum\nolimits_{t=t_0}^{t-1}\, (E-1)\eta^2_t \Vert \nabla F_k(w_t^k, \xi_t^k)\Vert\\ &\leq 4\eta_t^2 (E-1)^2 G^2 \end{align*} \]

因此最终,令\(\Delta_t=E\Vert \bar w_{t+1}-w^*\Vert\)

\[\begin{align*} \Delta_{t+1}\leq(1-\mu\eta_t)\Delta_t + \eta_t^2 B \end{align*} \]

\(B=\sum\, p_k^2 \sigma_k^2+6L\Gamma+8(E-1)^2G^2\)

参考资料

  1. Francis Bach, Statistical machine learning and convex optimization

猜你喜欢

转载自www.cnblogs.com/DemonHunter/p/12984659.html