What is gradient descent

Gradient descent (Gradient Descent GD) is simply one target function is minimized to find a method, which uses the gradient information to find a suitable target through continuous iterative adjustment parameter. This article will introduce its principle and implementation.

What is the gradient?

On the introduction of a gradient, it can be divided into four concepts: derivative - "partial derivative -" directional derivative - "gradient.

Derivative : a function is defined in the time domain and the values of the real number domain, the derivative may represent the slope of the tangent function curve.

Partial derivatives : the partial derivative is actually the partial derivative of the function of a multivariate multivariate function is its derivative with respect to one variable, while keeping other variables constant. Because each point has an infinite number of tangent on the curved surface, describe the derivative of the function is very difficult. Partial derivative is selected wherein a tangent, and its slope is determined. Significance is a geometric point tangent slope fixing surface.

Of Function drop variation time dimension, such as a fixed binary function y, so that only a single x changes, so changes to one yuan as a function of x to study.

[official]Refers to a function in the y direction is constant, the rate of change function values ​​along the x-axis direction,

[official]Refers to a function in the x direction is not changed, the function change ratio along the y-axis direction

However, the partial derivatives have the disadvantage that only represents the rate of change in the function of polyhydric coordinate axis direction, but a lot of time to consider the rate of change in any direction multivariate function, so there is the directional derivative.

Directional derivative : derivative in a certain direction, is essentially a function definition numerous slope of the tangent at point A, and each represents a tangential direction, each direction is well-guide numbers.

Gradient : Gradient is a vector, the directional derivative in the direction of its largest, fastest change is a function of the direction of the gradient at this point, the maximum rate of change.

So step by step approach in machine learning, solving an optimization iteration, often using the gradient along the direction of the gradient vector is the fastest, easier to find the maximum value of the function function increases, in turn, along the gradient vector opposite place, reducing the gradient of the fastest, easier to find the minimum.

What is gradient descent

For common example: you stand on the mountain somewhere, you want to down the mountain as soon as possible, so decided to take things one step, that is, when each went to a position solved gradient current position in the negative direction of the gradient, which is the current the most precipitous go down, go all the way, the foot is likely to come to the stage, but a local mountain lowest. As shown below:

1584274080296

Above, we can summarize: gradient descent method is to solve the minimum in the direction of the gradient descent along the direction of maximum gradient ascent can be obtained, this method is called gradient ascent.

From the map you can see: the starting point is affected and function characteristics of the target, gradient descent is a global optimal solution, it may be only a partial optimal solution, then when I could get the global optimal solution does not necessarily find? This function is related to the loss when the loss function is a convex function, you can find the global optimum.

Some important concepts

According to the principle of solving the above-mentioned gradient descent, we need to understand a few important concepts as gradient descent related to:

Step (Learning rate): the length direction of the target line in front of each step gradient descent, down the mountain with the above example, the step size is walking along the steepest downhill of the most vulnerable position at the current location that this step-by-step length .

Suppose the function (function hypothesis) : In supervised learning, in order to fit the input samples, and assuming the function used commonly H () represents, for a linear regression model, assuming that the function is the function
\ [Y = W_0 + W_1X1 + W_2X2 +. .. + W_nX_n \]
loss function (loss function): common J () that in order to assess the quality of the model, usually loss function to measure the degree of fit. Minimized the loss function, which means the fit best, the corresponding model parameter is the optimal parameters. The purpose of each machine learning models have a loss of function, learning is to minimize the loss function,

Detailed algorithms

Specific implementation of the gradient descent algorithm is:

  1. And determining a function of the model assumptions loss function
  2. Related initialization parameters, comprising: a parameter, and the algorithm terminates from step
  3. Determining the current position of the gradient of the loss function
  4. Multiplying step gradient, resulting from the lowered position of the current
  5. Determine whether all parameters of gradient descent algorithm terminates a distance less than the distance, if less than the algorithm terminates, otherwise the next step
  6. Update all parameters, go to step 1 updated

the problem we are facing

Experience all the gradient descent optimization problems common two issues: local minima and saddle points.

Local minimum

This is a problem most commonly encountered gradient descent method, when there is a function of many local minima, the gradient descent method is likely to just find a local minimum of them to stop.

How to avoid it?

Examples down, we see a different initial value, a minimum value is obtained may vary, so the local minimum avoidance can be the easiest way several times with different values ​​of the initial execution of the algorithm, an initial value of the selected minimum loss function.

Saddle Point

Saddle point is a phenomenon optimization problem often encountered mathematical meaning saddle point is: the objective function gradient is zero at this point, there is a function of maximum points in one direction but from the point of departure, and the other direction is a function of minimum point. Typical saddle point saddle point function is typically a function f (x) = x ^ 3 (0,0), the function z = x ^ 2-y ^ a plurality of saddle point (0,0,0) 2 (1 , 1,0), (2,2,0)).

In a highly non-convex spaces, a large number of saddle points, which makes the gradient descent sometimes fail, although not the minimum value, but it seems indeed converge.

Tuning

From the point of view of execution steps of the algorithm, where needed tuning comprises:

  1. Step: different scenarios to select the desired step size and weigh experiments, longer step, the faster iteration, there is likely to miss the optimal solution, the step size is too small, too slow iteration, the algorithm can not be the end of a very long time. So step algorithm of the need to run several times to get a more optimal value.
  2. Initial value: different initial values, a minimum value finally obtained may be different, but may obtain a local minimum; course, if the loss function is convex necessarily optimal solution. Several times with different initial values ​​need to run the algorithm, select the initial value of the loss function is minimized.

Common gradient descent

Batch gradient descent (Batch Gradient Descent BGD)

Algorithm described above is actually a batch gradient descent. First calculate the loss of all data values, and then performing a gradient descent, the specific steps are: traversing the entire data set count as one loss function, and the gradient functions calculated for each parameter, updates the gradient. This method is updated once every parameter, should the data set of all samples is calculated again, computing capacity, computing speed is slow, does not support online learning.

Stochastic gradient descent (Stochastic Gradient Descent SGD)

Total amount of the sample is not used to calculate the gradient, as would be approximated using a single gradient, can greatly reduce the amount of calculation, computational efficiency. The specific steps are: each selection from a random sample of the training set, the corresponding loss is calculated and the gradient, the parameter update iteration.

In this manner, a relatively large data size can reduce the computational complexity, a single sample from the gradient of the probability sense is unbiased estimate of the entire data set gradient, but there is some uncertainty, and therefore the convergence rate ratio batch gradient descent slower.

Small batch gradient descent (Mini-batch Gradient Descent)

In order to overcome the drawbacks of the above two methods, the means by a compromise: the data into a plurality of batches, batches update parameters, each set of data in one batch together determine the direction of this gradient decreased it will not be easy deviation, reduce the randomness, on the other hand, because the number of samples in batches of a lot less than the entire data set, computing the amount is not great.

Each use a plurality of samples to estimate the gradient, thus reducing the uncertainty improve the convergence rate, the number of samples wherein each iteration is referred to as batch size selection (batch size).

reference:

What is gradient descent

Why gradient in the opposite direction is a function of the value of the local decline in the fastest direction?

Gradient descent Summary

Guess you like

Origin www.cnblogs.com/ybjourney/p/12508027.html