Huashu reading notes (7)-optimization in the depth model

Summary of all notes: "Deep Learning" Flower Book-Summary of Reading Notes

"Deep Learning" PDF free download: "Deep Learning"

1. What is the difference between learning and pure optimization

In most machine learning problems, we focus on certain performance metrics PPP , which is defined on the test set and may be unsolvable. Therefore, we only indirectly optimizePPP . We hope that by reducing the cost functionJ (θ) J(θ)J ( θ ) to increasePPP . This is different from pure optimization, which minimizes the goalJJJ itself.

  • Empirical risk minimization E (x, y) ∼ p ^ data [L (f (x; θ), y)] = 1 m ∑ i = 1 m L (f (x (i); θ), y (i )) E_{(x,y)\sim\hat p_{data}}[L(f(x;\theta),y)]=\frac1m\sum_{i=1}^mL(f(x^{ (i)};\theta),y^{(i)})E( x , y ) p^data[L(f(x;θ ) ,and ) ]=m1i=1mL(f(x(i);θ ) ,and( i ) )Minimizing empirical risk can easily lead to overfitting. High-volume models will simply remember the training set. In many cases, minimizing empirical risk is not really feasible.
  • Proxy loss function and early termination The
    training algorithm usually does not stop at the local minimum. Conversely, machine learning usually optimizes the proxy loss function, but stops when the convergence condition based on early termination is met.
  • Batch algorithm and mini-batch algorithm

2. Challenges in neural network optimization

Optimization is usually an extremely difficult task. Traditional machine learning will carefully design the objective function and constraints to ensure that the optimization problem is convex, thereby avoiding the complexity of general optimization problems. When training neural networks, we will definitely encounter general non-convex situations. Even convex optimization is not without any problems.

  • Illness (Newton's method needs a lot of changes when applied to neural networks)
  • Local minimum
  • Plateaus, saddle points and other flat areas (at saddle points, the Hessian matrix has both positive and negative eigenvalues)
  • Cliff and gradient explosion
  • Long-term dependence
  • Inexact gradient
  • Weak correspondence between local and global structure
  • Theoretical limitations of optimization (any optimization algorithm has performance limitations)

Three, the basic algorithm

  • Stochastic gradient descent
    Take m small batches (independent and identically distributed) samples according to the data generation distribution, and by calculating their gradient mean, we can get an unbiased estimate of the gradient.
  • Momentum
    Momentum methods are designed to accelerate learning, especially when dealing with high curvature, small but consistent gradients, or noisy gradients. The momentum algorithm accumulates the previous moving average of the exponential decay of the gradient and continues to move in that direction.
  • Nesterov momentum
    reduces the rate of additional error convergence (after k steps, 1 k → 1 k 2 \frac1k\rightarrow\frac1{k^2}k1k21), but in the case of stochastic gradients, Nesterov momentum does not improve the convergence rate.

Four, parameter initialization strategy

Training a deep model is a difficult enough problem that most algorithms are greatly affected by the initialization choices.

Modern initialization strategies are simple and heuristic. Setting an improved initialization strategy is a difficult task because neural network optimization is not well understood yet.

Under normal circumstances, we can set a heuristic selection constant for the bias of each unit, and only initialize the weight randomly. Additional parameters (such as those used to encode the variance of the prediction conditions) are usually set as heuristically selected constants like the bias.

Five, adaptive learning rate algorithm

The learning rate is one of the hyper-parameters that is difficult to set because it has a significant impact on the performance of the model. Momentum algorithm can alleviate these problems to a certain extent, but the price of doing so is the introduction of another hyperparameter.

  • AdaGrad
    independently adapts to the learning rate of all model parameters, scaling each parameter inversely to the square root of the sum of all its gradient history square values. The parameters with the largest loss of partial derivatives have a correspondingly rapidly decreasing learning rate, while the parameters with small partial derivatives have a relatively small decrease in the learning rate.
  • RMSProp
    modified AdaGrad to achieve better results under non-convex settings, changing the gradient accumulation to an exponentially weighted moving average.
  • Adam
    momentum is directly incorporated into the estimation of the first moment of the gradient (exponentially weighted). The most intuitive way to add momentum to RMSProp is to apply momentum to the scaled gradient.

Six, second-order approximation method

  • Newton method
    complexity is O (k 3) O(k^3)O(k3)
  • Conjugate gradient
  • BFGS

Seven, optimization strategies and meta-methods

  • Batch standardization: In fact, it is not an optimization algorithm, but an adaptive reparameterization method, trying to solve the difficulty of training very deep models.

  • Coordinate down: Optimize one or a block of coordinates at a time.

  • Polyak Average: Average optimization algorithm visits several points in the trajectory in the parameter space.

  • Supervised pre-training: Before training the target model to solve the target problem, train a simple model to solve the simplified problem

  • Design a model that helps optimize

  • Extension method: used to overcome the problem of local minima.

  • Course learning: Reflected in training recurrent neural network to capture long-term dependence

The next chapter portal: Huashu reading notes (8)-convolutional network

Guess you like

Origin blog.csdn.net/qq_41485273/article/details/112912765