目录
Fancier optimization
# Vanilla Gradient Descent
while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad
for this type of objective function
what does a saddle point mean?
that means the at my current point some directions the loss goes up
Nesterov momentum
AdaGrad
RMSProp
At the very first time step, you can see that at the beginning, we've initialized our second moment with zero.Now after one update of the second moment, typically this beta two, second moment decay rate, is something like 0.9 or 0.99 something very close to one.
After one update, our second moment is still very very close to zero.Now when we're making our update step here and we divide by our second moment, now we're deviding by a very small number
Adam adds this bias correction term to aviod this problem of taking very large steps
If you can afford to do full batch updates then try out L-BFGS
Regularization
dropout !!
More common: Inverted dropout
data augmentation!
(颜色抖动)