Learn Note05--Neural Network Training

1. When Gradient is small

1.1 Optimization Fails because:

Critical point is which one:

Local minima? No way to go

Saddle point? Escape

1.2 Tayler Series Approximation

Hessian:

  • Local minima:
    H is positive definite = All eigen values are positive
  • Local maxima:
    H is negative definite = All eigen values are negative
  • Saddle point:
    Some eigen values are positive, and some are negative
    Tips: H may tell us parameter update direction.

2. Batch and Momentum

2.1 Review:Optimization with Batch

1epoch = see all the batches once, Suffle after each epoch.

  • Large batch
    Long time for cool down, but powerful
    Larger batch size dose not require logner time to compute gradient(unless batch size is too large)
  • Small batch
    Short time for cooldown, but noisy
    Smaller batch requires longer time for one epoch(long time for seeing all data once)

Conclusion:
Smaller batch size has better performance
‘Noisy’ update is better for training

2.2 Momentum

Small Gradient descent
Vanilla Gradient descent

Gradient descent + Momentum
Movement: movement of last step minus gradient at present
Movement not just based on gradient, but previous movement

3 Adaptive Learning Rate

Training stuck =! Small Gradient
Training can be difficult even without critical points
Different parameters needs different learning rate
Root Mean Square
Learning rate adapts dynamically
Error Surface can be very complex.
RMSProp
The recent gradient has larger influence, and the past gradients have less influence.

3.2 Learning Rate Scheduling

  • Leaning Rate Decay
  • Warm up
  • Summary of Optimization

4 Classification(Short Version)

4.1 Classification as Regression?

  • Regression
  • Classification as regression

4.2 Class as one-hot vector

Regression
Classification
Soft-max
Loss of Classification
Mean Square Error(MSE)
Cross-entropy
Minimizing cross-entropy is equivalent to maximizing likelihood.

5 Quick Introduction of Batch Normalization

5.1 Changing Landscape

  • Feature Normalization
    Ingeneral, feature normalization makes gradient descent converge faster.

5.2 Considering Deep Learning

Batch normalization
We do not always have batch at testing stage

猜你喜欢

转载自blog.csdn.net/minovophy/article/details/118979852