李宏毅机器学习笔记---Gradient Descent

gradient

  • gradient is a vector
  • gradient是等高线的法线方向

    set the learning rate carefully

  • 太小-迭代时间过长
  • 太大-cannot find the minima
  • if there are more than three parameters,you cannot
    visualize this.

    adaptive learning rate (adagrad)

  • 1/t decay: $ \eta^t = \frac{\eta}{\sqrt{t+1}}$
  • vanilla gradient descent | adagrad
  • adagrad learning rate 之反差
  • Larger gradient,lager step?|if there are more than two parameters,then no!
  • best step is $\frac{|firstderivative|}{secondderivative}$|adagrad在不增加额外运算量的情况下用first derivative估算second derivative的大小

    stochastic gradient descent

  • $ L(\theta)$ 从考虑所有error之和改为仅仅考虑当前error

    feature scaling

  • make diffrent features have the same scaling
  • 等高线图为正圆形,update更容易效率更高

    gradient descent基于taylor series

    eg:在等高线图上随机选取一点作为初始点,以其为圆心画圆,寻找满足是的loss function最小的点 $\theta=(\theta_1 ,\theta_2)$
    $$ L(\theta)=L(a,b)+\frac{\partial L(a,b)}{\partial \theta_1}(\theta_1-a)+\frac{\partial L(a,b)}{\partial \theta_2}(\theta_2-b)$$
    设u= $\frac{\partial L(a,b)}{\partial \theta_1}$,v=$\frac{\partial L(a,b)}{\partial \theta_2}$,$\triangle \theta_1=(\theta_1-a)$,$\triangle \theta_2=(\theta_1-b)$,内积定义$a\cdot b=|a||b|cos(a,b)$。

$ L(\theta)中L(a,b),u,v均为常数,故当(u,v)\cdot (\triangle \theta_1,\triangle \theta_2)最小时,可找到L(\theta)最小值。$

$要使内积最小,那么应使cos(a,b)=-1且|a|,|b|尽量大,也即a=-\eta b,其中\eta 为正常数,可以得到(\triangle \theta_1,\triangle \theta_2)=-\eta (u,v),化简可得(\theta_1,\theta_2)=(a,b)-\eta (u,v)$

上式即为gradient descent的原始公式,其中 $(u,v)=(\frac{\partial L(a,b)}{\partial \theta_1},\frac{\partial L(a,b)}{\partial \theta_2})$ 即为梯度。

猜你喜欢

转载自www.cnblogs.com/cartmanfatass/p/11625560.html
今日推荐