Review a machine learning (gradient descent)

Learning algorithm for solving the model parameters of the machine, i.e., unconstrained optimization, gradient descent (Gradient Descent) is one of the most commonly used method, another commonly used method is the least squares method. Here to do a complete summary of the gradient descent.

  1. Gradient
        calculus inside, seeking function parameters polyol partial derivative ∂, the partial derivatives of each parameter obtained in the form of written vector is the gradient. For example the function F (x, y), respectively, x, y find the partial derivative, gradient vector is obtained (∂f / ∂x, ∂f / ∂y ) T, referred grad f (x, y) or ▽ f (x, y). In particular for the gradient vector points (x0, y0) is (∂f / ∂x0, ∂f / ∂y0 ) T. Or ▽ f (x0, y0), if the gradient vector of the three parameters, that is (∂f / ∂x, ∂f / ∂y, ∂f / ∂z) T, and so on.

Then the gradient vector seek out what is it? His significance from the geometric sense, is a function of changes in the fastest growing areas. Specifically, for the function f (x, y), the point (x0, y0), the direction of the gradient vector is (∂f / ∂x0, ∂f / ∂y0) direction T is f (x, y ) fastest growing place. Or, in the direction of the gradient vector, easier to find the maximum value of the function. Conversely, in the opposite direction of the gradient vector, that is - (∂f / ∂x0, ∂f / ∂y0) T direction, the fastest reduction gradient, which is easier to find the minimum value of the function.

  1. Gradient descent gradient rise
        in machine learning algorithm, when the loss function is minimized, can be lowered by an iterative gradient method for solving step by step, the loss function is minimized, and the model parameter values. Conversely, if we need to solve the maximum loss function, then we need to iterate the gradient ascent method.

Gradient descent method is ramped up and can be transformed into each other. For example, we need to solve the minimum loss function f (θ), then we need to be solved by an iterative gradient descent method. But in fact, we can turn to solve the maximum loss function -f (θ), then gradient ascent method comes in handy.

Summarize in detail below gradient descent method.

  1. Detailed gradient descent algorithm for
    visual interpretation 3.1 gradient descent
        first look at an intuitive explanation of the gradient descent. For example, our position on a mountain somewhere, since we do not know how down and decided to take things one step, that is, every time come to a position solved gradient current position in the negative direction of the gradient, also It is currently the most precipitous walk down one step, and then continue to solve the current position of the gradient, step by step to position this location along the steepest downhill most vulnerable position. Such a step by step along this road until you come to feel that we have reached the foot of a mountain. Of course go all the way, it is possible that we can not come to the foot, but to a peak at a low partial.

As seen from the above explanation, the gradient descent may not be able to find the global optimum solution, there may be a local optimum. Of course, if the loss function is a convex function, the resulting solution gradient descent method is necessarily global optimal solution.

3.2 gradient descent concepts
    before to learn more about the gradient descent algorithm, we take a look at some of the concepts related.

1. step (Learning rate): step in the process determines the gradient descent iteration, the length of each step in the forward direction of the negative gradient. Examples down with the above, the step size is walking along the steepest downhill of the most vulnerable position at the current location that this step-by-step length.

2. characteristic (feature): refers to the sample input section, such as 2 single sample feature (x (0), y (0)), (x (1), y (1)), the first wherein the sample x (0), a first output sample y (0).

3. Suppose function (hypothesis function): In supervised learning, in order to fit the input samples, using the assumption that function, referred to as hθ (x). For example, a single feature of m samples (x (i), y (i)) (i = 1,2, ... m), fit function may be adopted as follows: hθ (x) = θ0 + θ1x.

4. loss function (loss function): To assess the quality of model fit, usually measured by the fit of the loss function. Loss Function Minimization, meaning the best goodness of fit, is the optimal model parameters corresponding to the parameters. Linear regression, loss of function is typically the difference between the output of the sample squared functions and assumptions. For example, for m samples (xi, yi) (i = 1,2, ... m), linear regression, loss function as:

         J(θ0,θ1)=∑i=1m(hθ(xi)−yi)2

Wherein xi denotes the i th sample characteristics, yi represents the i-th output corresponding to the sample, hθ (xi) as a function hypothesis.

3.3 gradient descent algorithm detailed
    gradient descent algorithm can have the algebraic method and matrix method (also known as vector method) two representations, if not familiar with the analysis of the matrix, the algebraic method easier to understand. But more concise matrix method, and the use of a matrix, to achieve a more logical glance. Here we introduce algebraic method, after the introduction matrix method.

3.3.1 algebraically gradient descent method described
    1. Prerequisites: Suppose confirmation function and optimization model loss function.

For example, linear regression, expressed as a function of assumed hθ (x1, x2, ... xn) = θ0 + θ1x1 + ... + θnxn, where θi (i = 0,1,2 ... n) of model parameters, xi (i = 0,1 , 2 ... n) of the n characteristic values ​​of each sample. This representation can be simplified, we add a feature x0 = 1, so hθ (x0, x1, ... xn) = Σi = 0nθixi.

Also linear regression function corresponding to the above assumptions, the loss function:

       J(θ0,θ1...,θn)=12m∑j=0m(hθ(x(j)0,x(j)1,...x(j)n)−yj)2

2. Related Parameters Initialization algorithm: Initialization primarily θ0, θ1 ..., θn, the algorithm terminates and step distance ε α. In the absence of any prior knowledge, I like all θ is initialized to 0, the step size is initialized to 1. In the re-optimization tuning time.

3. algorithmic process:

1) determining the gradient of the loss function of the current position, for theta] i, the gradient expression which follows:

∂∂θiJ (theta] 0, ... theta] 1, [theta] n)
      2) multiplied with a step gradient of the loss function, the current position obtained from the lowered, i.e. α∂∂θiJ (θ0, θ1 ..., θn ) corresponding to the preceding examples mountaineering a step.

3) determining whether all the theta] i, the gradient descent distance less than [epsilon], [epsilon] is less than the algorithm terminates, all current θi (i = 0,1, ... n) is the final result. Otherwise, proceed to step 4.

4) Update all θ, for θi, the following updating expression thereof. After the update is complete continue to go to step 1.

θi = θi-α∂∂θiJ (θ0, θ1 ..., θn)
    below with reference to the example specifically described linear regression gradient descent. We assume that the sample is (x (0) 1, x (0) 2, ... x (0) n, y0), (x (1) 1, x (1) 2, ... x (1) n, y1) , ... (x (m) 1 , x (m) 2, ... x (m) n, ym), the loss function as the foregoing prerequisites:

J(θ0,θ1…,θn)=12m∑j=0m(hθ(x(j)0,x(j)1,…x(j)n)−yj)2。

In the algorithmic process steps 1 θi partial derivatives calculated as follows:

∂∂θiJ (θ0, θ1 ..., θn ) = 1mΣj = 0m (hθ (x (j) 0, x (j) 1, ... x (j) n) -yj) x (j) i
    Since the sample no x0 the equation so that all xj0 1.

Step 4 Updating θi following expression:

       θi=θi−α1m∑j=0m(hθ(x(j)0,x(j)1,...xjn)−yj)x(j)i

As can be seen from this example, the direction of the gradient of the current point is determined by all of the samples, plus 1m for good understanding. Since the step size is also constant, also the opportunity of their constant, where α1m may be represented by a constant.

In gradient descent variants Section 4 below mentioned in detail, their main difference is the method of sample. Here we use with all samples.

3.3.2 Matrix gradient descent method described
    in this section mainly on gradient descent methods matrix expression, with respect to 3.3.1 algebraic method, requires a certain basic knowledge of the matrix analysis, in particular knowledge guide matrix.

1. Prerequisites: 3.3.1 and the like, and functions need to verify assumptions optimization model loss function. For linear regression, θ0 + θ1x1 + ... + a matrix representation of the function assumes that hθ (x1, x2, ... xn) = θnxn is:

hθ (X) = Xθ, wherein the function is assumed hθ (X) of mx1 vector, [theta] is the (n + 1) x1 vector, which has the model parameters n + 1 algebraic method. X is mx (n + 1) dimensional matrix. M represents the number of samples, n + 1 characteristic number of a representative sample.

         损失函数的表达式为:J(θ)=12(Xθ−Y)T(Xθ−Y), 其中Y是样本的输出向量,维度为mx1.

2. The initialization algorithm parameters: θ vector may be initialized to a default value, or the value after tuning. Algorithm terminates distance ε, α and 3.3.1 step than without change.

3. algorithmic process:

1) determining the gradient of the loss function of the current position of the vector θ for which the gradient following expression:

∂∂θJ ([theta])
      2) is multiplied by the step size of the gradient of the loss function, the current position obtained from the lowered, i.e. α∂∂θJ (θ) corresponding to the previous example climbing a step.

3) determining a value for each vector θ inside, gradient descent distance less than [epsilon], [epsilon] is less than the algorithm terminates, the final result is the current vector θ. Otherwise, proceed to step 4.

4) vector θ updated, it updates the following expression. After the update is complete continue to go to step 1.

i = i-a∂∂thJ (i)

Examples of linear regression is used to describe a particular algorithmic process.

Partial derivative of the loss function for the vector θ is calculated as follows:

∂∂θJ (θ) = XT (Xθ -Y)
    Step 4 Updating expression vector [theta] as follows: θ = θ-αXT (Xθ -Y)
    for algebraic method 3.3.1, can be seen to be much simple matrix method. This matrix equation which uses the derivation chain rule, and two guides matrix.

This matrix equation which uses the derivation chain rule, and the two matrix derivation.

Equation 1: ∂∂x (xTx) = 2xx vector
      Equation 2: ∇Xf (AX + B) = AT∇Yf, Y = AX + B, f (Y) is a scalar
    if necessary turned recommendation familiar matrix of reference Zhang XD the "matrix analysis and applications," a book.

3.4 tuning gradient descent algorithm
    when using a gradient descent, needs to be tuned. What areas need to tune it?

1. step selection algorithm. In the algorithm described earlier, I mentioned to take steps of 1, but in fact, the value depends on the data samples may take more than some values, from largest to smallest, are running algorithm, a look at the effect of the iteration, if the loss of function smaller, indicating that the value is valid, otherwise they will be increased in steps. He said earlier. The step size is too big, too fast can cause iteration, there may even miss the optimal solution. The step size is too small, too slow iteration, the algorithm can not be the end of a very long time. So step algorithm of the need to run several times to get a more optimal value.

2. The initial value selection algorithm parameters. Different initial values, the minimum value is obtained may also be different, so only determined gradient descent local minima; course, if the loss function is convex necessarily optimal solution. Due to the risk of local optima, several times with the minimum required to run the algorithm, the key loss function different initial values, select the initial value of the loss function is minimized.

3. normalized. Due to the different characteristics of the sample ranges are not the same, may lead to very slow iteration, in order to reduce the impact of the value characteristics, can feature data normalization, which is characteristic for each x, find out its expectations x¯¯¯ and standard deviation std (x), and then converted to:

x-x¯¯¯std (x)
    New expect such features is 0, the new variance 1, the number of iterations can be greatly accelerated.

  1. Gradient Descent large family (BGD, SGD, MBGD)
    4.1 batch gradient descent (Batch Gradient Descent)
        batch gradient descent, is the gradient descent of the most commonly used method in the form of specific approach is used when updating parameters for all samples to be updating, this method corresponds to the foregoing linear regression gradient descent algorithm 3.3.1, 3.3.1 of the gradient descent algorithm is say batch gradient descent method.

θi = θi-αΣj = 0m ( hθ (x (j) 0, x (j) 1, ... x (j) n) -yj) x (j) i
    Since we have m samples, where required gradient gradient data with the time on all the m samples.

4.2 stochastic gradient descent method (Stochastic Gradient Descent)
    stochastic gradient descent method, in fact, a similar principle and batch gradient descent method, there is no distinction all m data samples in the gradient and demand, but only to select a sample to find the gradient j . Updated corresponding formula is:

θi = θi-α (hθ ( x (j) 0, x (j) 1, ... x (j) n) -yj) x (j) i
    stochastic gradient descent, and 4.1 are two batch gradient descent extreme, all the data using a gradient descent, for a sample with a gradient descent. Natural advantages and disadvantages are very prominent. For training speed, the stochastic gradient descent method Because each iteration using only a sample to the training quickly, while a large batch gradient descent method when the sample size, speed training is not satisfactory. For accuracy, the stochastic gradient descent method is used with only one sample gradient direction decision resulting solution is most likely not optimal. For the convergence speed, because of stochastic gradient descent iteration of a sample, resulting in a large change in direction iteration, not quickly converge to a local optimum solution.

So, is there a way to be able to combine the advantages of moderation are two ways to do? Have! This is 4.3 of a small batch gradient descent.

4.3 small batch gradient descent (Mini-batch Gradient Descent)
  small batch gradient descent method is a batch gradient descent and stochastic gradient descent method is a compromise, that is, for m samples, we used the x way to iterate, 1 <x < m. Generally take x = 10, of course, in accordance with data samples, the value of x may be adjusted. Updated corresponding formula is:

= theta] i = αΣj-theta] i + X-TT. 1 (H ?. (X (J) 0, X (J). 1, ... X (J) n-)-YJ) X (J) I
5. The gradient descent and other unconstrained optimization comparison algorithm
    in machine learning unconstrained optimization algorithm, in addition to the gradient descent, there is the aforementioned least squares method, Newton's method, and in addition to quasi-Newton method.

Compared gradient descent method and the least squares method, gradient descent method step size to choose, but does not require the least squares method. Gradient descent is an iterative solution method is a least squares calculation of analytical solution. If the sample size is not large, and there analytical solution, least squares method have advantages compared to gradient descent method to calculate fast. However, if the sample size is large, due to the request by the least squares method a super inverse matrix, then it is difficult or slow to solve the analytical solution, and has an advantage of using an iterative gradient descent method.

Compared gradient descent method and Newton / quasi-Newton methods, both iterative solver, but the gradient descent gradient is solved, the Newton method / quasi-Newton method is to use the Hessian matrix of second order or pseudo-inverse matrix of the inverse matrix . In contrast, using Newton's Method / quasi-Newton method converge faster. However, each iteration of time longer than the gradient descent method.

Guess you like

Origin blog.csdn.net/lgy54321/article/details/90677551