Loss of function and optimization methods commonly used machine learning

What are the common loss function? (Here's loss function is strictly objective function, generally call for the loss function)

See in particular:

https://blog.csdn.net/iqqiqqiqqiqq/article/details/77413541

1) Loss Function 0-1

The number of misclassified records.

2) the absolute value of the loss function

Commonly used regression

3) the square of the loss function

That is the square of the difference between the actual and predicted values ​​sum. Typically a linear model, such as linear regression model. The reason why the form of square, cubic or not the absolute values, because the maximum likelihood estimation (required minimum loss function) and the minimum squared loss are equivalent.

4) the number of losses

5) loss function index

What are commonly used optimization method?

Optimization of the loss function:

When we classified when Loss for improvement, we have to decrease by gradient, a gradient of each optimization step size, this time we claim deflector Loss for each weight matrix, then apply the chain rule.

Least Squares (main point is that the optimization algorithm for linear regression) gradient descent method, Newton method, quasi-Newton method, a conjugate gradient method

Detail about the gradient descent

In machine learning algorithms for solving the model parameters, that is unconstrained optimization, gradient descent ( Gradient Descent) is one of the most commonly used method, gradient descent may not be able to find global optimal solution, there may be a local optimum solution. Of course, if the loss function is a convex function, the resulting solution gradient descent method is necessarily global optimal solution.

1) gradient

Calculus inside, seeking function parameters polyhydric ∂ partial derivatives, the partial derivatives of each parameter obtained in the form of written vector is the gradient.

Then the gradient vector seek out what is it? His significance from the geometric sense, is a function of changes in the fastest growing areas. Or, in the direction of the gradient vector, easier to find the maximum value of the function. Conversely, in the opposite direction of the gradient vector, that is - (∂f / ∂x0, ∂f / ∂y0) T direction, the fastest reduction gradient, which is easier to find the minimum value of the function.

2) gradient descent gradient ascent and

In the machine learning algorithm, when the loss function is minimized, can be lowered by an iterative gradient method for solving step by step, a minimum step iterative solver heuristic function by way of the loss function is minimized, and the model parameter values . Conversely, if we need to solve the maximum loss function, then we need to iterate the gradient ascent method.

  Gradient descent method is ramped up and can be transformed into each other. For example, we need to solve loss function minimum value f (θ), then we need to be solved by an iterative gradient descent method. But in fact, we can turn to solve the maximum loss function -f (θ), then gradient ascent method comes in handy.

3) gradient descent algorithm tuning

When using gradient descent, it needs to be tuned.

First, step selection algorithm. In the algorithm described earlier, I mentioned to take steps to 1, but in fact, the value depends on the data samples may take more than some values, from largest to smallest, are running algorithm, a look at the effect of the iteration, if the loss of function value decreases, indicating that the value is valid, otherwise they will be increased in steps. He said earlier. The step size is too big, too fast can cause iteration, there may even miss the optimal solution. The step size is too small, too slow iteration, the algorithm can not be the end of a very long time. So step algorithm of the need to run several times to get a more optimal value.

Second, the initial value of the parameter selection algorithm. Different initial values, the minimum value is obtained may also be different, so only determined gradient descent local minima; course, if the loss function is convex necessarily optimal solution. Due to the risk of local optima, several times with the minimum required to run the algorithm, observing different initial values ​​of the loss function, select the initial value of the loss function is minimized.

Third, normalization. Due to the different characteristics of the sample ranges are not the same, may lead to very slow iteration, in order to reduce the impact of the value characteristics, can feature data normalization, which is characteristic for each x, find out its expectations and standards x¯ difference std (x), and then converted to x-x¯¯¯std (x) x- x¯std (x)

New expect this feature to 0, the new variance 1, the number of iterations can be greatly accelerated.

4) type of gradient descent

First, batch gradient descent method. Each update of the parameters of the need to use the entire training data set, the global optimal solution can be, but when a large amount of training data can be slow.

Second, stochastic gradient descent method. He is the choice of only one sample every iteration, of course, greatly enhance the training speed, but accuracy has declined, it may not be the optimal solution obtained. Easy to fall into the local optimal solution. It has not converge, but fluctuate around a minimum

Third, small quantities (part) gradient descent method which is a combination of two or more.

For the understanding of loss function and model training

Loss function ( Loss function) is used to estimate the predicted value f (x) is inconsistent with the extent of your model the true value of Y, which is a non-negative real-valued function, usually L (Y, f (x)) is represented, the smaller loss function, the robustness of the model, the better. Loss function is the core of empirical risk function, but also an important part of structural risk function. Structural risk function model includes empirical risk items and regular items, usually can be expressed as the following equation:

                    

  Usually loss function (strictly speaking, the objective function) is composed of two constituent parts of the above formula, the distance between the front part of the algorithm is predicted value and the real training sample label computing, different from the calculation represents a different calculation of the loss of function method. A second portion J (f) represents a regularization option, when the training function is too complex, may lead to over-fitting the training parameters, then the need to introduce complexity regularization factor control model. Prevent over-produced fit.

 Introduction to us about the way the loss of function parameters of learning,

        ωj = ωj - λ ∂L(ωj) / ∂ωj

By calculating the loss function with respect to the gradient parameter w w stepwise adjustment parameters to smaller loss function, the completion of a training model parameters to achieve convergence.

Guess you like

Origin www.cnblogs.com/dyl222/p/11020068.html