Deep learning (machine learning) optimization algorithm

First, the loss function: machine learning model main job is to assess, and loss function defines the evaluation index model !!

Common loss function has

  • mean_squared_error
  • mean_absolute_error
  • mean_absolute_percentage_error
  • mean_squared_logarithmic_error
  • squared_hinge
  • hinge
  • categorical_hinge
  • logcosh
  • categorical_crossentropy
  • sparse_categorical_crossentropy
  • binary_crossentropy (the Mutual entropy)
  • kullback_leibler_divergence
  • poisson
  • cosine_proximity

Second, the machine learning classical optimization algorithms

(1) Direct method: direct question given the optimal solution

  • Convex function optimization problem;
  • There are closed solution;

(2) the indirect method: estimates of the optimal solution of the modified iteration

  • Order method: an optimization order Taylor expansion function (gradient descent method )
  • Second-order method: to optimize the function of second order Taylor expansion (Newton's method)

Third, the gradient test: compute the gradient of the objective function, the gradient is calculated to write code that need to verify their code is correct !!

Fourth, the stochastic gradient descent algorithm

(1) Overview

  • Classical gradient descent algorithm: when parameters are updated, the need to traverse all the training data, computationally intensive, time-consuming;
  • Stochastic gradient descent algorithm: update can be performed with a single model parameter training data;

(2) common gradient descent algorithm

  • Batch gradient descent BGD (Batch Gradient Descent):
    for the entire data set, by calculating for all samples to solve the direction of the gradient.
    Advantages: global optimal solution; easy to implement in parallel;
    disadvantages: When a lot of sample data, a large amount of computing overhead slow calculation
  • Small batch gradient descent MBGD (mini-batch Gradient Descent)
    data into several batches, batches to update the parameters so that one set of data in one batch together determine the direction of this gradient, it is not easy to drop deviation, reducing the randomness
    advantages: reduces the calculation overhead and reduces the randomness
  • Stochastic gradient descent method SGD (stochastic gradient descent)
    each count calculation data loss function, and then update the parameters required gradient.
    Advantages: speed calculation
    drawbacks: the convergence performance is not good

Five, stochastic gradient descent algorithm acceleration

The reason for (1) the effect of poor training some background, the problem is not the model, but the failure of the stochastic gradient descent algorithm in optimization problems !!

(2) the main reason

  • Most of the optimization problem is, fall into local optimal solution, while stochastic gradient descent algorithm main problem is the valley and saddle point;
  • Valley: Valley bounce back and forth in shock, not fall rapidly in the right direction, leading to instability and slow convergence convergence;
  • Saddle point: at the saddle point, a stochastic gradient into the flat land (gradient insignificant) results in the wrong direction, to stop advance;

(3) Solution

  • The momentum: to maintain inertia
  • AdaGarda: Context-Aware
  • Adam: + inertia to maintain situational awareness

Six, L1 regularization and sparsity

The so-called sparsity, is that many parameters of the model is zero, which is equivalent to a model feature selection, leaving only some of the more important features to improve the generalization ability of the model to reduce the risk of over-fitting!

  • Solution space is a polygon L1, L2 solution space is circular;
  • W L1 of the model parameters a priori Laplace introduced, L2 regularization Gaussian prior is introduced, prior Laplace is more likely that the parameter is 0;
Published 17 original articles · won praise 15 · views 10000 +

Guess you like

Origin blog.csdn.net/z15692341130/article/details/104440696