First, the loss function: machine learning model main job is to assess, and loss function defines the evaluation index model !!
Common loss function has
mean_squared_error
mean_absolute_error
mean_absolute_percentage_error
mean_squared_logarithmic_error
squared_hinge
hinge
categorical_hinge
logcosh
categorical_crossentropy
sparse_categorical_crossentropy
binary_crossentropy (the Mutual entropy)
kullback_leibler_divergence
poisson
cosine_proximity
Second, the machine learning classical optimization algorithms
(1) Direct method: direct question given the optimal solution
Convex function optimization problem;
There are closed solution;
(2) the indirect method: estimates of the optimal solution of the modified iteration
Order method: an optimization order Taylor expansion function (gradient descent method )
Second-order method: to optimize the function of second order Taylor expansion (Newton's method)
Third, the gradient test: compute the gradient of the objective function, the gradient is calculated to write code that need to verify their code is correct !!
Fourth, the stochastic gradient descent algorithm
(1) Overview
Classical gradient descent algorithm: when parameters are updated, the need to traverse all the training data, computationally intensive, time-consuming;
Stochastic gradient descent algorithm: update can be performed with a single model parameter training data;
(2) common gradient descent algorithm
Batch gradient descent BGD (Batch Gradient Descent): for the entire data set, by calculating for all samples to solve the direction of the gradient. Advantages: global optimal solution; easy to implement in parallel; disadvantages: When a lot of sample data, a large amount of computing overhead slow calculation
Small batch gradient descent MBGD (mini-batch Gradient Descent) data into several batches, batches to update the parameters so that one set of data in one batch together determine the direction of this gradient, it is not easy to drop deviation, reducing the randomness advantages: reduces the calculation overhead and reduces the randomness
Stochastic gradient descent method SGD (stochastic gradient descent) each count calculation data loss function, and then update the parameters required gradient. Advantages: speed calculation drawbacks: the convergence performance is not good
The reason for (1) the effect of poor training some background, the problem is not the model, but the failure of the stochastic gradient descent algorithm in optimization problems !!
(2) the main reason
Most of the optimization problem is, fall into local optimal solution, while stochastic gradient descent algorithm main problem is the valley and saddle point;
Valley: Valley bounce back and forth in shock, not fall rapidly in the right direction, leading to instability and slow convergence convergence;
Saddle point: at the saddle point, a stochastic gradient into the flat land (gradient insignificant) results in the wrong direction, to stop advance;
(3) Solution
The momentum: to maintain inertia
AdaGarda: Context-Aware
Adam: + inertia to maintain situational awareness
Six, L1 regularization and sparsity
The so-called sparsity, is that many parameters of the model is zero, which is equivalent to a model feature selection, leaving only some of the more important features to improve the generalization ability of the model to reduce the risk of over-fitting!
Solution space is a polygon L1, L2 solution space is circular;
W L1 of the model parameters a priori Laplace introduced, L2 regularization Gaussian prior is introduced, prior Laplace is more likely that the parameter is 0;