1. When Gradient is small
1.1 Optimization Fails because:
Critical point is which one:
Local minima? No way to go
Saddle point? Escape
1.2 Tayler Series Approximation
Hessian:
- Local minima:
H is positive definite = All eigen values are positive - Local maxima:
H is negative definite = All eigen values are negative - Saddle point:
Some eigen values are positive, and some are negative
Tips: H may tell us parameter update direction.
2. Batch and Momentum
2.1 Review:Optimization with Batch
1epoch = see all the batches once, Suffle after each epoch.
- Large batch
Long time for cool down, but powerful
Larger batch size dose not require logner time to compute gradient(unless batch size is too large) - Small batch
Short time for cooldown, but noisy
Smaller batch requires longer time for one epoch(long time for seeing all data once)
Conclusion:
Smaller batch size has better performance
‘Noisy’ update is better for training
2.2 Momentum
Small Gradient descent
Vanilla Gradient descent
Gradient descent + Momentum
Movement: movement of last step minus gradient at present
Movement not just based on gradient, but previous movement
3 Adaptive Learning Rate
Training stuck =! Small Gradient
Training can be difficult even without critical points
Different parameters needs different learning rate
Root Mean Square
Learning rate adapts dynamically
Error Surface can be very complex.
RMSProp
The recent gradient has larger influence, and the past gradients have less influence.
3.2 Learning Rate Scheduling
- Leaning Rate Decay
- Warm up
- Summary of Optimization
4 Classification(Short Version)
4.1 Classification as Regression?
- Regression
- Classification as regression
4.2 Class as one-hot vector
Regression
Classification
Soft-max
Loss of Classification
Mean Square Error(MSE)
Cross-entropy
Minimizing cross-entropy is equivalent to maximizing likelihood.
5 Quick Introduction of Batch Normalization
5.1 Changing Landscape
- Feature Normalization
Ingeneral, feature normalization makes gradient descent converge faster.
5.2 Considering Deep Learning
Batch normalization
We do not always have batch at testing stage