16[NLP训练营]L-Lipschit定理和GD收敛证明


公式输入请参考: 在线Latex公式

定理1

一个光滑函数(smooth function)f满足L-Lipschitz条件,则对于任意 x , y R d x,y\in R^d ,我们有:
f ( x ) f ( y ) L x y (定理1) ||\triangledown f(x)-\triangledown f(y)||\leq L||x-y||\tag{定理1}
这里的L是常数。
用线性回归为例,先写出线性回归的损失函数:
L = 1 n X w y 2 L=\cfrac{1}{n}||Xw-y||^2
这里X是训练数据,w是参数
现在求:
f ( w 1 ) f ( w 2 ) = 2 n X T ( X w 1 y ) X T ( X w 2 y ) = 2 n X T X ( w 1 w 2 ) 2 n X T X ( w 1 w 2 ) ||\triangledown f(w_1)-\triangledown f(w_2)||\\ =\cfrac{2}{n}||X^T(Xw_1-y)-X^T(Xw_2-y)||\\=\cfrac{2}{n}||X^TX(w_1-w_2)||\leq \cfrac{2}{n}||X^TX||\cdot||(w_1-w_2)||
由于X是训练数据,是已知的。所以 X T X ||X^TX|| 相当于L-Lipschitz中的常数项L。

定理2

一个光滑函数(smooth function)f满足L-Lipschitz条件,并且是凸函数,则对于任意 x , y R d x,y\in R^d ,我们有:
f ( y ) f ( x ) + f ( x ) ( y x ) + L 2 y x 2 (定理2) f(y)\leq f(x)+\triangledown f(x)(y-x)+\cfrac{L}{2}||y-x||^2\tag{定理2}
这里的L是常数。
证明:
根据积分的性质有:
h ( x ) : h ( 1 ) = h ( 0 ) + 0 1 h ( τ ) d τ (1) h(x):h(1)=h(0)+\int_0^1h'(\tau)d\tau\tag1
自己定义(为什么要这样定义不知道):
h ( τ ) = f ( x + τ ( y x ) ) h(\tau)=f(x+\tau(y-x))
然后有:
h ( 1 ) = f ( y ) , h ( 0 ) = f ( x ) h(1)=f(y),h(0)=f(x)
把公式1带入上面:
f ( y ) = f ( x ) + 0 1 h ( τ ) d τ f(y)=f(x)+\int_0^1h'(\tau)d\tau
把求导看做是复合函数求导即可:

f ( y ) = f ( x ) + 0 1 f ( x + τ ( y x ) ) ( y x ) d τ f(y)=f(x)+\int_0^1\triangledown f(x+\tau(y-x))(y-x)d\tau
加一项 f ( x ) ( y x ) \triangledown f(x)(y-x) ,积分里面减一项 f ( x ) ( y x ) \triangledown f(x)(y-x)
f ( y ) = f ( x ) + f ( x ) ( y x ) + 0 1 ( f ( x + τ ( y x ) ) f ( x ) ) ( y x ) d τ f(y)=f(x)+\triangledown f(x)(y-x)+\int_0^1(\triangledown f(x+\tau(y-x))-\triangledown f(x))(y-x)d\tau
根据前面的定理1
f ( y ) = f ( x ) + f ( x ) ( y x ) + 0 1 ( f ( x + τ ( y x ) ) f ( x ) ) ( y x ) d τ f ( x ) + f ( x ) ( y x ) + 0 1 L τ ( y x ) y x d τ f(y)=f(x)+\triangledown f(x)(y-x)+\int_0^1(\triangledown f(x+\tau(y-x))-\triangledown f(x))(y-x)d\tau\\ \leq f(x)+\triangledown f(x)(y-x)+\int_0^1L||\tau(y-x)||||y-x||d\tau
后面那个 y x ||y-x|| 是根据 a b a b ab\leq|a||b| 性质得到的,化简,积分后得:
f ( y ) f ( x ) + f ( x ) ( y x ) + L 2 y x 2 f(y)\leq f(x)+\triangledown f(x)(y-x)+\cfrac{L}{2}||y-x||^2
证明完毕。

推论1

根据定理2,把 f ( x i + 1 ) f(x_{i+1}) 看做 f ( y ) f(y) ,把 f ( x i ) f(x_{i}) 看做 f ( x ) f(x)
f ( x i + 1 ) f ( x i ) + f ( x i ) ( x i + 1 x i ) + L 2 x i + 1 x i 2 f(x_{i+1})\leq f(x_{i})+\triangledown f(x_{i})(x_{i+1}-x_{i})+\cfrac{L}{2}||x_{i+1}-x_{i}||^2
由于:
f ( x i + 1 ) = f ( x i ) η t f ( x i ) f ( x i + 1 ) f ( x i ) = η t f ( x i ) f(x_{i+1})=f(x_{i})-\eta_t \triangledown f(x_{i})\to f(x_{i+1})-f(x_{i})=-\eta_t \triangledown f(x_{i})
所以有:
f ( x i + 1 ) f ( x i ) + f ( x i ) ( 1 ) η t f ( x i ) + L 2 η t 2 f ( x i ) 2 f(x_{i+1})\leq f(x_{i})+\triangledown f(x_{i})(-1)\eta_t \triangledown f(x_{i})+\cfrac{L}{2}\eta_t ^2\triangledown f(x_{i})^2
f ( x i + 1 ) f ( x i ) η t f ( x i ) 2 + L η t 2 2 f ( x i ) 2 f(x_{i+1})\leq f(x_{i})-\eta_t ||\triangledown f(x_{i})||^2+\cfrac{L\eta_t ^2}{2}||\triangledown f(x_{i})||^2
f ( x i + 1 ) f ( x i ) η t ( 1 L η t 2 ) f ( x i ) 2 (1) f(x_{i+1})\leq f(x_{i})-\eta_t (1-\cfrac{L\eta_t }{2})||\triangledown f(x_{i})||^2\tag1


补充说明:Convergence Analysis of Gradient Descent

迭代式的梯度下降的迭代次数的收敛分析
定理
假设函数满足L-Lipschitz条件(条件1),并且是凸函数(条件2),设定 x = a r g m i n f ( x ) x^*=argminf(x) ,那么对于步长 η t 1 L \eta_t\leq\cfrac{1}{L} (L是常数),满足:
. .
f ( x k ) f ( x ) + x 0 x 2 2 2 η t k f(x_k)\leq f(x^*)+\cfrac{||x_0-x^*||^2_2}{2\eta_tk}
当我们迭代 k = L x 0 x 2 2 ε k=\cfrac{L||x_0-x^*||^2_2}{\varepsilon} 次之后我们可以保证得到。
其中 ε \varepsilon 是approximation optimal value x x 。( η t = 1 L \eta_t=\cfrac{1}{L} )
x k x_k 是第k次迭代的x值。 x k x_k 慢慢接近 x x^* 也就是说不等式右边的最后一项是随着k变大慢慢变小的,如果变小的速度快,收敛速度就快。例如:
在这里插入图片描述
B方案的收敛速度比A要慢。

继续分析 x 0 x 2 2 2 η t k \cfrac{||x_0-x^*||^2_2}{2\eta_tk} x 0 x_0 是不变的, x x^* 是最优解也是不变的,也就是分子是不变的;分母中 2 η t 2\eta_t 也是不变的,所以整个这一项随着k的变大慢慢变小。

k = L x 0 x 2 2 ε k=\cfrac{L||x_0-x^*||^2_2}{\varepsilon} ,相当于我们得到一个 ε \varepsilon 的估计值,把这个带入上面:
. .
x 0 x 2 2 2 η t k = x 0 x 2 2 2 η t L x 0 x 2 2 ε = ε 2 η t L \cfrac{||x_0-x^*||^2_2}{2\eta_tk}=\cfrac{||x_0-x^*||^2_2}{2\eta_t\cfrac{L||x_0-x^*||^2_2}{\varepsilon}}=\cfrac{\varepsilon}{2\eta_tL}
当我们把步长设置为: η t = 1 L L = 1 η t \eta_t=\cfrac{1}{L} \to L=\cfrac{1}{\eta_t} ,带入上面:

ε 2 η t L = ε 2 η t 1 η t = ε 2 \cfrac{\varepsilon}{2\eta_tL}=\cfrac{\varepsilon}{2\eta_t\cfrac{1}{\eta_t}}=\cfrac{\varepsilon}{2}
整理一下就是说,当 k = L x 0 x 2 2 ε k=\cfrac{L||x_0-x^*||^2_2}{\varepsilon} 时:
f ( x k ) f ( x ) + ε 2 f(x_k)\leq f(x^*)+\cfrac{\varepsilon}{2}
也写为:
f ( x k ) f ( x ) + O ( ε ) f(x_k)\leq f(x^*)+O(\varepsilon)
ε \varepsilon 很小的时候, x k x_k x x^* 差距也很小


根据补充说明中的一个条件: η t 1 L \eta_t\leq\cfrac{1}{L} ,把这个条件带入(1):
f ( x i + 1 ) f ( x i ) η t ( 1 L η t 2 ) f ( x i ) 2 f ( x i ) η t ( 1 L 1 L 2 ) f ( x i ) 2 = f ( x i ) η t 2 f ( x i ) 2 f(x_{i+1})\leq f(x_{i})-\eta_t (1-\cfrac{L\eta_t }{2})||\triangledown f(x_{i})||^2\\ \leq f(x_{i})-\eta_t (1-\cfrac{L\cfrac{1}{L} }{2})||\triangledown f(x_{i})||^2=f(x_{i})-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2

推论2

根据推论1:
f ( x i + 1 ) f ( x i ) η t 2 f ( x i ) 2 (2) f(x_{i+1})\leq f(x_{i})-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2\tag2
根据凸函数的First order convexity,(图片来源见水印https://zhuanlan.zhihu.com/p/57652786)
在这里插入图片描述
我们把(2)的第一项写开:
f ( x i + 1 ) f ( x i ) η t 2 f ( x i ) 2 f ( x ) + f ( x i ) ( x i x ) η t 2 f ( x i ) 2 (3) f(x_{i+1})\leq f(x_{i})-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2\\ \leq f(x^*)+\triangledown f(x_i)(x_i-x^*)-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2\tag3


这里这个条件不知道哪里出来的,看形势就是梯度更新的公式:
x i + 1 = x i η t f ( x i ) f ( x i ) = x i x i + 1 η t x_{i+1}=x_i-\eta_t\triangledown f(x_{i})\to \triangledown f(x_{i})=\cfrac{x_{i}-x_{i+1}}{\eta_t}
x i + 1 = x i η t f ( x i ) η t f ( x i ) = x i x i + 1 x_{i+1}=x_i-\eta_t\triangledown f(x_{i})\to \eta_t\triangledown f(x_{i})=x_{i}-x_{i+1}
带入(3):


f ( x i + 1 ) f ( x i ) η t 2 f ( x i ) 2 f ( x ) + x i x i + 1 η t ( x i x ) η t 2 x i x i + 1 η t 2 = f ( x ) + x i x i + 1 η t ( x i x ) 1 2 η t x i x i + 1 2 = f ( x ) + 2 ( x i 2 x i x x i x i + 1 + x i + 1 x ) 2 η t x i 2 2 x i x i + 1 + x i + 1 2 2 η t = f ( x ) + 1 2 η t x i x 2 1 2 η t ( x i x 2 2 η t f ( x i ) ( x i x ) + η t f ( x i ) 2 ) = f ( x ) + 1 2 η t x i x 2 1 2 η t x i x η t f ( x i ) 2 f(x_{i+1})\leq f(x_{i})-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2\\ \leq f(x^*)+\cfrac{x_{i}-x_{i+1}}{\eta_t}(x_i-x^*)-\cfrac{\eta_t }{2}||\cfrac{x_{i}-x_{i+1}}{\eta_t}||^2\\ =f(x^*)+\cfrac{x_{i}-x_{i+1}}{\eta_t}(x_i-x^*)-\cfrac{1 }{2\eta_t}||{x_{i}-x_{i+1}}||^2\\ =f(x^*)+\cfrac{2(x_{i}^2-x_ix^*-x_ix_{i+1}+x_{i+1}x^*)}{2\eta_t}-\cfrac{x_i^2-2x_ix_{i+1}+x^2_{i+1} }{2\eta_t}\\ 这里通过加一项减一项,最后整合为:\\ =f(x^*)+\cfrac{1 }{2\eta_t}||x_i-x^*||^2-\cfrac{1 }{2\eta_t}\left (||x_i-x^*||^2-2\eta_t\triangledown f(x_{i})(x_i-x^*)+ ||\eta_t\triangledown f(x_{i})||^2\right )\\ =f(x^*)+\cfrac{1 }{2\eta_t}||x_i-x^*||^2-\cfrac{1 }{2\eta_t}||x_i-x^*-\eta_t\triangledown f(x_{i})||^2

= f ( x ) + 1 2 η t x i x 2 1 2 η t x i + 1 x 2 = f ( x ) + 1 2 η t ( x i x 2 x i + 1 x 2 ) =f(x^*)+\cfrac{1 }{2\eta_t}||x_i-x^*||^2-\cfrac{1 }{2\eta_t}||x_{i+1}-x^*||^2\\ =f(x^*)+\cfrac{1 }{2\eta_t}(||x_i-x^*||^2-||x_{i+1}-x^*||^2)
推论2结束,结果如下:
f ( x i + 1 ) f ( x ) + 1 2 η t ( x i x 2 x i + 1 x 2 ) (推论2) f(x_{i+1})\leq f(x^*)+\cfrac{1 }{2\eta_t}(||x_i-x^*||^2-||x_{i+1}-x^*||^2)\tag{推论2}

推论3

将推论2移项:
f ( x i + 1 ) f ( x ) 1 2 η t ( x i x 2 x i + 1 x 2 ) f(x_{i+1})-f(x^*)\leq \cfrac{1 }{2\eta_t}(||x_i-x^*||^2-||x_{i+1}-x^*||^2)
下面考虑从i=0开始看:
f ( x 1 ) f ( x ) 1 2 η t ( x 0 x 2 x 1 x 2 ) f(x_{1})-f(x^*)\leq \cfrac{1 }{2\eta_t}(||x_0-x^*||^2-||x_{1}-x^*||^2)
f ( x 2 ) f ( x ) 1 2 η t ( x 1 x 2 x 2 x 2 ) f(x_{2})-f(x^*)\leq \cfrac{1 }{2\eta_t}(||x_1-x^*||^2-||x_{2}-x^*||^2)
f ( x 3 ) f ( x ) 1 2 η t ( x 2 x 2 x 3 x 2 ) f(x_{3})-f(x^*)\leq \cfrac{1 }{2\eta_t}(||x_2-x^*||^2-||x_{3}-x^*||^2)
以此类推:
f ( x k ) f ( x ) 1 2 η t ( x k 1 x 2 x k x 2 ) f(x_{k})-f(x^*)\leq \cfrac{1 }{2\eta_t}(||x_{k-1}-x^*||^2-||x_{k}-x^*||^2)
如果我们把上面的不等式左右两边分开累加到一起。
i = 1 k f ( x k ) k f ( x ) 1 2 η t ( x i x 2 x k x 2 ) \sum_{i=1}^kf(x_{k})-kf(x^*)\leq \cfrac{1 }{2\eta_t}(||x_i-x^*||^2-||x_{k}-x^*||^2)
右边放大一点,去掉一个平方项:
i = 1 k f ( x k ) k f ( x ) 1 2 η t x i x 2 (4) \sum_{i=1}^kf(x_{k})-kf(x^*)\leq \cfrac{1 }{2\eta_t}||x_i-x^*||^2\tag4


根据推论1的结论:
f ( x i + 1 ) f ( x i ) η t 2 f ( x i ) 2 f(x_{i+1})\leq f(x_{i})-\cfrac{\eta_t }{2}||\triangledown f(x_{i})||^2
以及: η t 2 0 \cfrac{\eta_t }{2}\geq0 f ( x i ) 2 0 ||\triangledown f(x_{i})||^2\geq0 两个条件可知:
f ( x i + 1 ) f ( x i ) f(x_{i+1})\leq f(x_{i})
因此我们可以写出:
f ( x k ) f ( x k 1 ) f ( x k 2 ) f ( x 0 ) f(x_{k})\leq f(x_{k-1})\leq f(x_{k-2})\leq\cdots\leq f(x_{0})
根据这个,我们把(4)的左边进行缩小:
k f ( x k ) k f ( x ) i = 1 k f ( x k ) k f ( x ) kf(x_{k})-kf(x^*)\leq \sum_{i=1}^kf(x_{k})-kf(x^*)


整理:
k f ( x k ) k f ( x ) 1 2 η t x i x 2 kf(x_{k})-kf(x^*)\leq \cfrac{1 }{2\eta_t}||x_i-x^*||^2
. .
f ( x k ) f ( x ) x i x 2 2 η t k f(x_{k})-f(x^*)\leq \frac{||x_i-x^*||^2}{2\eta_tk}
结束,上面这个式子就是补充说明里面梯度下降收敛的证明。

发布了172 篇原创文章 · 获赞 40 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/oldmao_2001/article/details/104647793