Notation
符号 |
含义 |
\(F(w)\) |
总目标函数 |
\(w\) |
待优化参数 |
\(F_k(w)\) |
第\(k\)个client的目标函数 |
\(p_k\) |
第\(k\)个client在总表函数中的占比 |
\(E\) |
每个local update的次数 |
\(T\) |
总迭代次数,即通讯次数为\(T/E\) |
\(\eta_t\) |
\(t\)时刻学习率 |
\(L\) |
函数\(F_k\)为\(L-smooth\),即\(\nabla^2 F_k(w)\leq L\) |
\(\mu\) |
函数\(F_k\)为\(\mu\)convex,即\(\nabla^2 F_k(w)\geq u\) |
\(\sigma_k\) |
\(E(\Vert\nabla F_k(w_t^k, \xi^k_t)-\nabla F_k(w_t^k)\Vert)\leq\sigma^2_k\) |
\(G\) |
\(E(\Vert\nabla F_k(w_t^k, \xi^k_t)\Vert)\leq G^2\) |
\(F^*\) |
最优函数值 |
\(F_k^*\) |
单独对第\(k\)个client优化得到的最优函数值 |
\(\Gamma\) |
\(F^*-\sum_k\,p_kF_k^*\),度量异质性 |
\(\eta_t\) |
\(t\)时刻的学习率 |
\(\kappa\) |
\(\frac{L}{\mu}\),可近似看为条件数 |
\(w^*\) |
最优参数 |
\(\xi_t^k\) |
client \(k\)在\(t\)时刻进行随机梯度下降选出的样本 |
假设
- 函数\(F_k\)为\(L-smooth\),对于所有的\(k\)
- 函数\(F_k\)为\(\mu-convex\),对于所有的\(k\)
- client\(k\)每次计算的随机梯度的方差是\(\sigma^2_k\)有界的
- 所有计算的随机梯度的范数是\(G\)有界的
全参与下的收敛性证明
引入两个变量,\(v^k_t\)和\(w_t^k\),
\[\begin{align*} v_{t+1}^k &= v_t^k - \eta_t\nabla F_k(w_t^k, \xi_t^K)\\ w_{t+1}^k &= \left\{ \begin{matrix} v_{t+1} \quad &\text{for }t+1 \notin I_E\\ \sum_k^{N}\,p_k v_{k+1}^k \quad &\text{for }t+1 \in I_E\end{matrix} \right. \end{align*} \]
定义
\[\begin{align*} \bar v_t &= \sum\nolimits_k \, p_k v_t^k\\ \bar w_t &= \sum\nolimits_k \, p_k w_t^k\\ \bar g_t &= \sum\nolimits_k\, p_k\nabla F_k(w_t^k)\\ g_t &=\sum\nolimits_k p_k\nabla F_k(w_t^k, \xi_k^t) \end{align*} \]
由于\(t\in I_E\),在能够交换的迭代轮次才能获取参数的更新\(w\),变量\(v\)用来表示在不能进行交换数据的轮次的参数。由于全局参与,\(\bar v_t = \bar w_t\)对于所有的\(t\),而且\(\bar v_{t+1} = \bar w_t - \eta_t g_t\)。
个人理解:
要证明收敛性需要证明,参数是收敛的,由于参数\(\bar w_t\)是根据梯度下降求出来的,所以需要证明,
\[E\Vert \bar w_{t+1} - w^*\Vert \leq l(\bar w_{t} - w) \]
即当前迭代的参数和最优点的\(w^*\)的距离是小于上一次迭代参数与最优参数的距离,而且\(l\)函数是可以递推的。也就是说,当前迭代的参数和最优点的\(w^*\)的距离的上界是逐渐减小的。
文章没有选择\(\bar w - w^*\)而是选择了\(\bar v - w^*\),因为\(\bar v\)是对应所有client的,在部分参与的场景下,\(\bar w\)是偏差的。
\[\begin{align*} E\Vert \bar v_{t-1} - w^*\Vert^2 &= E\Vert \bar w_t - \eta_t g_t -w^*-\eta_t \bar g_t + \eta_t \bar g_t\Vert^2\\ &= E\left(\Vert \bar w_t -\eta _t \bar g_t -w^*\Vert ^2 + 2\eta_t <\bar w_t-\eta_t\bar g_t -w^*, -g_t+\bar g_t> + \eta_t^2 \Vert\bar g_t-g_t\Vert^2\right) \end{align*}\label{eq:1} \tag{1} \]
上面之所以拆分出\(-\eta_t \bar g_t + \eta_t \bar g_t\)的原意是想利用\(E(\eta_t g_t - \eta_t \bar g_t)=0\)来对\(\ref{eq:1}\)进行拆分。
对于\(E\left(\Vert \bar w_t -\eta _t \bar g_t -w^*\Vert ^2\right)\)继续进行计算
\[\begin{align*} \Vert \bar w_t - \eta_t \bar g_t - w^*\Vert^2&= \Vert \bar w_t - w^*\Vert^2 - 2\eta_t<\bar w_t-w^*, \bar g_t> +\eta_t^2\Vert \bar g_t\Vert^2\\ \label{eq:2}\tag{2} \end{align*} \]
根据\(L-smooth\)[1]
\[\begin{align*} \Vert \nabla F_k(w_t^k)\Vert^2 \leq 2L(F_k(w_t^k - F_k^*)) \label{eq:3} \tag{3} \end{align*} \]
因为二范数为凸函数再结合\(~\ref{eq:3}\),得到
\[\begin{align*} \eta^2_t \Vert \bar g_t \Vert^2 &\leq \eta_t^2 \sum \, p_k\Vert \nabla F_k(w_t^k)\Vert^2\\ &\leq 2L\eta_t^2(F_k(w_t^k - F_k^*)) \end{align*} \]
对于\(2\eta_t <\bar w_t -w^*, \bar g_t>\)展开
\[\begin{align*} -2\eta_t <\bar w_t -w^*, \bar g_t>&=-2\eta_t \sum\, p_k <\bar w_t-w^*, \nabla F_k(w^k_t)> \\ &= -2\eta_t \sum\, p_k <\bar w_t - w_t^k, \nabla F_k(w_t^k)>-2\eta_t\sum\, p_k < w_t^k - w_t^k, \nabla F_k(w_t^k)> \end{align*} \]
根据可惜施瓦茨不等式和矩阵不等式得到
\[\begin{align*} -2<\bar w_t-w_t^k, \nabla F_k(w_t^k) > \leq \frac{1}{\eta_t} \Vert \bar w_t - w_t^k\Vert^2+\eta_k \Vert \nabla F_k(w_t^k)\Vert^2 \end{align*} \]
根据\(\mu-convex\)得到
\[\begin{align*} -<w_t^k - w^*, \nabla F_k(w_t^k)> \leq -(F_k(w_t^k)-F_k(w^*)) - \frac{\mu}{2}\Vert w_t^k-w^*\Vert^2 \end{align*} \]
因此\(~\ref{eq:2}\)写为
\[\begin{align*} \Vert \bar w_t - \eta_t \bar g_t - w^*\Vert^2&= \Vert \bar w_t - w^*\Vert^2 - 2\eta_t<\bar w_t-w^*, \bar g_t> +\eta_t^2\Vert \bar g_t\Vert^2\\ & \leq \Vert \bar w_t - w^*\Vert^2 + 2L\eta_t^2(F_k(w_t^k - F_k^*)) + \\ &\quad\eta_t\sum\, p_k\left(\frac{1}{\eta_t} \Vert \bar w_t - w_t^k\Vert^2+\eta_k \Vert \nabla F_k(w_t^k)\Vert^2\right) -\\ &\quad2\eta_t \sum\, p_k(F_k(w_t^k)-F_k(w^*) - \frac{\mu}{2}\Vert w_t^k-w^*\Vert^2)\\ & \leq(1-\mu\eta_t)\Vert \bar w_t- w^*\Vert^2 + \sum\, p_k \Vert \bar w_t-w^k_t\Vert^2+ \\&\quad4L\eta_t^2 \sum \, p_k(F_k(w_t^k)-F_k^*) - 2\eta_t\sum\, p_k(F_k(w_t^k)-F_k(w^*)) \end{align*} \]
拿出后面的\(4L\eta_t^2 \sum \, p_k(F_k(w_t^k)-F_k^*) - 2\eta_t\sum\, p_k(F_k(w_t^k)-F_k(w^*))\)一项,定义\(\gamma_t=2\eta_t(1-2L\eta_t)\),同时假定\(\eta_t\)随时间非增(也就是学习率是衰减的)且\(\eta_t\leq \frac{1}{4L}\)。可以得到定义的\(\eta_t\leq\gamma_t\leq 2 \eta_t\)。整理得到
\[\begin{align*} &4L\eta_t^2 \sum \, p_k(F_k(w_t^k)-F_k^*) - 2\eta_t\sum\, p_k(F_k(w_t^k)-F_k(w^*))\\ &=-2\eta_t(1-2L\eta_t)\sum \, p_k(F_k(w_t^k)-F_k^*)+2\eta_t\sum\, p_k(F_k(w^*)-F_k*)\\ &=-\gamma_t \sum \, p_k(F_k(w_t^k)-F^*) + (2\eta_t - \gamma_t)\sum\, p_k(F^*-F_k^*)\\ &=-\gamma_t \sum\, p_k(F_k(w_t^k)-F^*) + 4L\eta_t^2\Gamma \end{align*} \]
第三个等号将\(F_k(w_t^k-F_k^*)\)拆分为\(F_k(w_t^k)-F^*+F^* - F_k^*\)
\[\begin{align*} \sum\, p_k(F_k(w_t^k)-F^*) & = \sum\, p_k (F_k(w_t^k)-F_k(\bar w_t)) + \sum\, p_k(F_k(\bar w_t)-F^*)\\ & \geq \sum\, p_k <\nabla F_k(\bar w_t), w_t^k-\bar w_t> +F_k(\bar w_t)-F^*\\ &\geq -\frac 1 2\sum\, p_k \left[\eta_t\Vert \nabla F_k(\bar w_t)\Vert^2 + \frac{1}{\eta}\Vert w_t^k -\bar w_t\Vert^2\right] + (F(\bar w_t)-F^*)\\ &\geq -\sum\, p_k \left[\eta_t L(F_k(\bar w_t) - F_k^*) + \frac{1}{2\eta}\Vert w_t^k -\bar w_t\Vert^2\right] + (F(\bar w_t)-F^*) \end{align*} \]
综上所述,
\[\begin{align*} &4L\eta_t^2 \sum \, p_k(F_k(w_t^k)-F_k^*) - 2\eta_t\sum\, p_k(F_k(w_t^k)-F_k(w^*))\\ &\leq \gamma_t \sum\, p_k \left[\eta_t L(F_k(\bar w_t) - F_k^*) +\frac{1}{2\eta}\Vert w_t^k -\bar w_t\Vert^2\right]-\gamma(F(\bar w_t)-F^*)+ + 4L\eta_t\Gamma\\ & = \gamma_t(\eta_tL-1)\sum \, p_k(F_k(\bar w_t) - F^*) + (4L\eta_t^2+\gamma_t\eta_tL)+\frac{\gamma}{2\eta_t}\sum \, p_k \Vert w_t^k - \bar w_t\Vert^2\\ &\leq 6L\eta_t^2\Gamma + \sum\, p_k \Vert w_t^k - \bar w_t\Vert^2 \end{align*} \]
最后一个不等式取得的原因是\((\eta_tL-1)\leq -\frac{3}{4}\),\(\sum \, p_k(F_k(\bar w_t) - F^*)\geq0\)
因此
\[\Vert \bar w_t -\eta _t \bar g_t -w^*\Vert ^2\leq (1-\mu \eta_t)\Vert \bar w_t- w^*\Vert^2 +6L\eta_t^2\Gamma + \sum\, p_k \Vert w_t^k - \bar w_t\Vert^2 \]
加上梯度方差有限假设,即
\[\begin{align*} E\Vert g_t - \bar g_t\Vert^2 &= E\Vert \sum \, p_k(\nabla F_k(w_k, \xi_t^k) - \nabla F_k(w_t^k))\Vert\\ & \leq \sum\, p_k^2 \sigma_k^2 \end{align*} \]
\[\begin{align*}E\sum\, p_k \Vert \bar w_t - w^k_t\Vert^2 &= E\sum \,p_k\Vert w^k_t-\bar w_{t_0}-(\bar w_t - \bar w_{t_0})\Vert^2\\ &\leq E\sum\, p_k \Vert w_t^k - \bar w_{t_0}\Vert^2\\ &\leq \sum p_k E\sum\nolimits_{t=t_0}^{t-1}\, (E-1)\eta^2_t \Vert \nabla F_k(w_t^k, \xi_t^k)\Vert\\ &\leq 4\eta_t^2 (E-1)^2 G^2 \end{align*} \]
因此最终,令\(\Delta_t=E\Vert \bar w_{t+1}-w^*\Vert\)
\[\begin{align*} \Delta_{t+1}\leq(1-\mu\eta_t)\Delta_t + \eta_t^2 B \end{align*} \]
\(B=\sum\, p_k^2 \sigma_k^2+6L\Gamma+8(E-1)^2G^2\)
参考资料
- Francis Bach, Statistical machine learning and convex optimization