1. When Gradient is small

1.1 Optimization Fails because:

Critical point is which one:

Local minima? No way to go

Saddle point? Escape

1.2 Tayler Series Approximation

Hessian:

Local minima:
H is positive definite = All eigen values are positive
Local maxima:
H is negative definite = All eigen values are negative
Saddle point:
Some eigen values are positive, and some are negative
Tips: H may tell us parameter update direction.

2. Batch and Momentum

2.1 Review:Optimization with Batch

1epoch = see all the batches once, Suffle after each epoch.

Large batch
Long time for cool down, but powerful
Larger batch size dose not require logner time to compute gradient(unless batch size is too large)
Small batch
Short time for cooldown, but noisy
Smaller batch requires longer time for one epoch(long time for seeing all data once)

Conclusion:
Smaller batch size has better performance
‘Noisy’ update is better for training

2.2 Momentum

Small Gradient descent
Vanilla Gradient descent

Gradient descent + Momentum
Movement: movement of last step minus gradient at present
Movement not just based on gradient, but previous movement

3 Adaptive Learning Rate

Training stuck =! Small Gradient
Training can be difficult even without critical points
Different parameters needs different learning rate
Root Mean Square
Learning rate adapts dynamically
Error Surface can be very complex.
RMSProp
The recent gradient has larger influence, and the past gradients have less influence.

3.2 Learning Rate Scheduling

Leaning Rate Decay
Warm up
Summary of Optimization

4 Classification(Short Version)

4.1 Classification as Regression?

Regression
Classification as regression

4.2 Class as one-hot vector

Regression
Classification
Soft-max
Loss of Classification
Mean Square Error(MSE)
Cross-entropy
Minimizing cross-entropy is equivalent to maximizing likelihood.

5 Quick Introduction of Batch Normalization

5.1 Changing Landscape

Feature Normalization
Ingeneral, feature normalization makes gradient descent converge faster.

5.2 Considering Deep Learning

Batch normalization
We do not always have batch at testing stage

Learn Note05--Neural Network Training