Optimization algorithms -----week2

Batch vs. mini-batch gradientt descent
(1)可以分成5000个子集。
for t = 1, …., 5000
Forward prop on $x^{\{t\}}$ .
$Z^{[1]} = w^{[1]}x^{\{t\}} + b^{[t]}$
$A^{[1]} = g^{[1]}(Z^{[1]})$
…..
$A^{[l]} = g^{[l]}(z^{[l]})$
(2) Compute cost :
$J = \frac{1}{1000}\sum^l_{i=1}L(\hat{y}^{(i)},y^{(i)}) + \frac{\lambda}{2 \cdot 1000} \sum_l \|w^{[l]} \| ^ 2_F$
(3) Backprop to compute gradients $J^{\{ t \}} (x^{\{t\}},y^{\{t\}})$
$w^{[t]} = w^{[l]} - \alpha dw^{[l]}$
$b^{[l]} = b^{[l]} - \alpha db^{[l]}$
Choosing mini-batch size
(1)if mini-batch size = m: the size of training set——->Batch gradient descent.(如果训练集数据大将会运行很长时间)
(2) if mini-batch size = 1: Stochastic gradient descent—->Every example is its own mini-batch.(噪声大，而且最后总是在最小值附近摆动)
(3)Choose In-between(minibatch size not too big/small)
Some guidelines about choosing your mini-batch size:
(1)If small training(m <= 2000 ) set: Use batch gradient descent.
(2)Typical mini-batch size:64, 128, 256, 512(据说 $2^n$ 代码运行得快)
(3)Make sure some mini-batch fit in CPU/GPU memory. $x^{[t]}, y^{[t]}$
Exponentially weighted moving averages(指数加权滑动平均)
$V_t = \beta V_{t-1} + (1-\beta) \theta _{t}$
$\beta = 0.9 \approx \frac{1}{1-\beta} days' tempetaure$
$\beta = 0.98 \approx \frac{1}{1-\beta} =50 days' temperature$
Bias correction(偏差修正) in exponentially weighted average.
$V_t = \beta V_{t-1} + (1-\beta) \theta _{t}$
$\frac{V_t}{1-\beta^t}$
Gradient descent with momentum(动量梯度下降法):
(1) Compute $dw,db$ on current mini-batch.
$V_{dw} = \beta V_{dw} + (1-\beta) dw$
$V_{\theta} = \beta V_{\theta} + (1-\beta) \theta _{t}$
$V_{db} = \beta V_{db} + (1-\beta)db$
(2) Update $w,b$ :
$w = w - \alpha V_{dw}$
$b = b - \alpha V_{db}$
使得梯度下降法在垂直方向的震荡幅度变小，水平方向的移动更快速,以更快速度进行梯度下降。
(3)Implementation details:
$V_{dw} = 0, V_{db} = 0$
On iteration t:
Compute $dW,db$ on the current mini-batch.
$v_{dW} = \beta v_{dW} + (1-\beta) dW$
$v_{db} = \beta v_{db} + (1-\beta) db$
$W = W - \alpha v_{dW} ,b = b - \alpha v_{db}$
Hyperparameters: $\alpha$ , $\beta$ , $\beta = 0.9$
(4) RMSprop(root mean square prop)(均方根传递)
On iteration t:
Compute $dw, db$ on current mini-batch
$S_{dW} = \beta S_{dW} + (1-\beta) (dW)^2$ : $(dW)^2 ,element-wise$
$S_{db} = \beta S_{db} + (1-\beta) (db)^2$
Update:
$W = W - \alpha \frac{dW}{\sqrt{S_{dW}} + \varepsilon}$
$b = b - \alpha \frac{db}{\sqrt{S_{db} } + \varepsilon}$
(5) Adam optimization algorithms
$V_{dW} = 0, S_{dW} = 0, V_{db} = 0, S_{db} = 0$
On iteration t:
Compute $dW,db$ using current mini-batch.(mini-batch gradient)
“momentum”:
$V_{dW} = \beta_1 V_{dW} + (1-\beta _1)dW , V_{db} = \beta_1 V_{db} + (1-\beta_1)db$
“RMSprop”:
$S_{dw} = \beta_2 S_{dw} + (1-\beta_2) (dW) ^2, S_{db} = \beta_2 S_{db} + (1-\beta_2)db$
Bias corrected:
$v^{corrected}_{dw} = \frac{V_{dW}}{1-\beta^t_1},$
$V^{corrected}_{db} = \frac{V_{db}}{1-\beta^t_1}$
$S^{corrected}_{dW} = \frac{S_{dw}}{1-\beta^t_2},$
$S^{corrected}_{db} = \frac{S_{db}}{1-\beta^t_2}$
$W = W - \alpha \frac{V^{corrected}_{dw}}{\sqrt{S^{corrected}_{dw} +} + \varepsilon}$
$b = b - \alpha \frac{V^{corrected}_{db}}{\sqrt{S^{corrected}_{db}} +}$
Hyperparameters choice:
$\alpha: needs to be tune$
$\beta_1:0.9 (dW)$
$\beta_2:0.999 (dW^2) element-wise$
$\varepsilon: 10^{-8}$
Learning rate decay
epoch: 迭代次数
$\alpha = \frac{1}{ 1 + decay\_rate \cdot epoch\_numbers} \cdot \alpha_0$
8.Local optima in neural network

Optimization algorithms -----week2

猜你喜欢