动手学pytorch-凸优化

优化与深度学习

优化与估计

尽管优化方法可以最小化深度学习中的损失函数值,但本质上优化方法达到的目标与深度学习的目标并不相同。

  • 优化方法目标:训练集损失函数值
  • 深度学习目标:测试集损失函数值(泛化性)

优化在深度学习中的挑战

  1. 局部最小值
  2. 鞍点
  3. 梯度消失

局部最小值

\[ f(x) = x\cos \pi x \]

鞍点

\[ A=\left[\begin{array}{cccc}{\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}} \\ {\frac{\partial^{2} f}{\partial x_{2} \partial x_{1}}} & {\frac{\partial^{2} f}{\partial x_{2}^{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{2} \partial x_{n}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial^{2} f}{\partial x_{n} \partial x_{1}}} & {\frac{\partial^{2} f}{\partial x_{n} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{n}^{2}}}\end{array}\right] \]

e.g.

梯度消失

凸性 (Convexity)

基础

集合

Image Name
Image Name
Image Name

函数

\[ \lambda f(x)+(1-\lambda) f\left(x^{\prime}\right) \geq f\left(\lambda x+(1-\lambda) x^{\prime}\right) \]

Jensen 不等式

\[ \sum_{i} \alpha_{i} f\left(x_{i}\right) \geq f\left(\sum_{i} \alpha_{i} x_{i}\right) \text { and } E_{x}[f(x)] \geq f\left(E_{x}[x]\right) \]


性质

  1. 无局部极小值
  2. 与凸集的关系
  3. 二阶条件

无局部最小值

证明:假设存在 \(x \in X\) 是局部最小值,则存在全局最小值 \(x' \in X\), 使得 \(f(x) > f(x')\), 则对 \(\lambda \in(0,1]\):

\[ f(x)>\lambda f(x)+(1-\lambda) f(x^{\prime}) \geq f(\lambda x+(1-\lambda) x^{\prime}) \]

与凸集的关系

对于凸函数 \(f(x)\),定义集合 \(S_{b}:=\{x | x \in X \text { and } f(x) \leq b\}\),则集合 \(S_b\) 为凸集

证明:对于点 \(x,x' \in S_b\), 有 \(f\left(\lambda x+(1-\lambda) x^{\prime}\right) \leq \lambda f(x)+(1-\lambda) f\left(x^{\prime}\right) \leq b\), 故 \(\lambda x+(1-\lambda) x^{\prime} \in S_{b}\)

\(f(x, y)=0.5 x^{2}+\cos (2 \pi y)\)

凸函数与二阶导数

\(f^{''}(x) \ge 0 \Longleftrightarrow f(x)\) 是凸函数

必要性 (\(\Leftarrow\)):

对于凸函数:

\[ \frac{1}{2} f(x+\epsilon)+\frac{1}{2} f(x-\epsilon) \geq f\left(\frac{x+\epsilon}{2}+\frac{x-\epsilon}{2}\right)=f(x) \]

故:

\[ f^{\prime \prime}(x)=\lim _{\varepsilon \rightarrow 0} \frac{\frac{f(x+\epsilon) - f(x)}{\epsilon}-\frac{f(x) - f(x-\epsilon)}{\epsilon}}{\epsilon} \]

\[ f^{\prime \prime}(x)=\lim _{\varepsilon \rightarrow 0} \frac{f(x+\epsilon)+f(x-\epsilon)-2 f(x)}{\epsilon^{2}} \geq 0 \]

充分性 (\(\Rightarrow\)):

\(a < x < b\)\(f(x)\) 上的三个点,由拉格朗日中值定理:

\[ \begin{array}{l}{f(x)-f(a)=(x-a) f^{\prime}(\alpha) \text { for some } \alpha \in[a, x] \text { and }} \\ {f(b)-f(x)=(b-x) f^{\prime}(\beta) \text { for some } \beta \in[x, b]}\end{array} \]

根据单调性,有 \(f^{\prime}(\beta) \geq f^{\prime}(\alpha)\), 故:

\[ \begin{aligned} f(b)-f(a) &=f(b)-f(x)+f(x)-f(a) \\ &=(b-x) f^{\prime}(\beta)+(x-a) f^{\prime}(\alpha) \\ & \geq(b-a) f^{\prime}(\alpha) \end{aligned} \]

限制条件

\[ \begin{array}{l}{\underset{\mathbf{x}}{\operatorname{minimize}} f(\mathbf{x})} \\ {\text { subject to } c_{i}(\mathbf{x}) \leq 0 \text { for all } i \in\{1, \ldots, N\}}\end{array} \]

拉格朗日乘子法

Boyd & Vandenberghe, 2004

\[ L(\mathbf{x}, \alpha)=f(\mathbf{x})+\sum_{i} \alpha_{i} c_{i}(\mathbf{x}) \text { where } \alpha_{i} \geq 0 \]

惩罚项

欲使 \(c_i(x) \leq 0\), 将项 \(\alpha_ic_i(x)\) 加入目标函数,如多层感知机中的 \(\frac{\lambda}{2} ||w||^2\)

投影

\[ \operatorname{Proj}_{X}(\mathbf{x})=\underset{\mathbf{x}^{\prime} \in X}{\operatorname{argmin}}\left\|\mathbf{x}-\mathbf{x}^{\prime}\right\|_{2} \]

Image Name

猜你喜欢

转载自www.cnblogs.com/54hys/p/12335909.html