Learn note03–error and Gradient Descent
1. Bias and Variance
If large Bias, underfitting, need desin more complex model, add feature;
If large variance, overfitting, need more data, regularization.
Two ways:
Cross validation
N-fold cross validation
2. Gradient Descent
2.1 Tuning your learning rates
- Adaptive Learning Rates
Popular & Simple Idea: Reduce the learning rate by some factor every few epochs - Adagrad
Divide the learning rate of each parameter by the root mean square of its previous derivatives. Like:
Vanilla Gradient descent: larger gradient, larger step.
Adagrad: root mean square of the previous derivatives of parameter w.
2.2 Stochastic Gradient Descent——Make the Training Faster
Loss for only one example, update for each example.
2.3 Feature Scaling
Make different features have the same scaling.
2.4 Gradient Descent Theory
Each time we update the parameters, we obtain O that makes L(O) smaller.
2.5 Warning of Math
Formal derivation
Taylor Series
More Limitation of Gradient Descent: stuck at local minima , stuck at saddle point, very slow at the plateau.