Gradient descent

梯度下降

梯度下降实现最小化J(\theta_{0},\theta _{1}):

  1. 随机获取一个起点\theta_{0},\theta _{1}
  2. 重复计算下面公式直到收敛:

\theta_{j} := \theta _{j} - \alpha\frac{\partial J(\theta_{0},\theta_{1})}{\partial \theta_{j}},\quad j=0,1

同步更新参数:

temp0 = \theta_{0} - \alpha\frac{\partial J(\theta_{0},\theta_{1})}{\partial \theta_{0}}

temp1 = \theta_{1} - \alpha\frac{\partial J(\theta_{0},\theta_{1})}{\partial \theta_{1}}

\theta_{0} = temp0

\theta_{1} = temp1

面临问题:局部最小值,鞍点

动量

为迭代公式加上动量项,动量积累了之前的梯度权重更新值:、

x_{t+1} = x_{t}+V_{t+1}

V_{t+1} = -\alpha \triangledown f(x_{t})+\mu V_{t}

动量项积累之前的梯度信息,保持惯性,避免来回震荡,加快收敛速度

自适应梯度(Adaptive Gradient)

g_{t}是第t次迭代时的参数梯度向量,\varepsilon为防止除0操作:

x_{t+1} = x_{t}-\alpha \frac{g_{t}}{\sqrt{\sum_{j=1}^{t}g_{j}+\varepsilon }}

与标准梯度下降不同是多了分母一项,它积累了本次迭代次数为止,梯度历史信息用于生成梯度下降的系数值

Adam(Adaptive moment )

由梯度构造两个变量m,v。初始值为0:

\begin{matrix} m_{t} = \beta _{1}m_{t-1}+(1-\beta _{1})g_{t}\\ v_{t} = \beta _{2}v_{t-1}+(1-\beta _{2})g_{t}^{2}\\ \end{matrix}

其中\beta _{1},\beta_{2}为人工设置参数:

x_{t+1} = x_{t}-\alpha \frac{\sqrt{1-\beta_{2}^{t}}}{1-\beta_{1}^{t}} \frac{m_{t}}{\sqrt{v_{t}}+\varepsilon }

m替代梯度,v构造学习率

 

 

 

猜你喜欢

转载自blog.csdn.net/linshuo1994/article/details/83060877