The principle of trust region algorithm

Optimization method introduced

Mentioned optimization method, a common gradient descent (derived are batch gradient descent, stochastic gradient descent), Newton method (derived from the quasi-Newton) and the like. We know that, in the machine learning optimization is to optimize the loss function, to obtain the minimum value, i.e. (mathop min {F} limits_theta (x_theta {})) , where (theta) is the loss function parameter optimization the purpose is to find the best (theta) so that the minimum loss function. The method of gradient descent loss function is determined at a certain point of the gradient, and then take a small step in the negative gradient direction, and then continues to the gradient Innovation point, iteration continues until the defined number of iterations or minimum gradient, the end of the iteration, the minimum value is obtained. For the principle of Newton's method, where a simple derivation: first with the Taylor expansion approximation to the target function (f (x)) is the [varphi (x) = f ( {x_k}) + f '({x_k}) (x - x_k} {) + FRAC. 1} {2} {F '' (x_k {}) {(X - x_k {}) ^ 2}] (Phi (X)) is a second order expansion, higher order terms are omitted Since we assume that the objective function approximation, then we most value to the function, is calculated using the method quadratic approximation of the derivative after expansion, then the other equal to 0, i.e., the following: [varphi ' (X) = 0,] [F '(x_k {}) + F' '(x_k {}) (X - x_k {}) = 0] and stars [x = {x_k} - frac {f' ({ x_k})} {f '' ({x_k})}] where these lines emotional our understanding, we know that, at the point when the function (x_k)When Taylor expansion, whichever previous cascade (Newton's method is to take the second-order) function is used to approximate the original, so near that point only in the expanded, with a Taylor expansion to approximate the original function as a formula, because we take after all, limited order (higher order we have omitted), then expand the function from (x_k) point distant point there may be a large error, then the above formula, we seek to expand the function extremes, extremes we find only approximate the function, not the original function of the extreme value, but with extreme values very close to the original, then we continue to expand on the new point, and then continue seeking extreme value, it will getting close to the original function of the extreme value. Imagine if the original function is relatively smooth in the expanded point, then the secondary will be very close approximation of the original function near that point, then the extreme value extreme value will be determined each time with the approximate function of the original function very close, then the convergence of Newton's method will soon. But if the original function at a stand off point is extremely uneven, potholed function, then at that point we expanded secondary approximation, it may be very different with the original function, it may be determined with extreme values extreme value of the original function is not very close, it will become even more distant, perhaps this is the reason some of the limitations of Newton's law of it. In the method mentioned above, we found that gradient descent is to determine the direction, and then go in the direction, Newton's method is very similar, after the commencement of approximation function, along the extreme value approximation function, in fact, a second-order along the walking directions. Then this article would say that another kind of thinking, if we first uncertain direction, but first determine the next step we should take is the length, from the initial point, determine the next step to go the length, then the length of the radius in the spatial region, to find a minimum point, come to the point, and then determining the length of the new point, and then sequentially according to this method iterates until you find the minimum point of the function. Next we detail the principles of this method.

The principle of trust region methods

Along the above-mentioned problems, we propose a few questions, according to these questions, expand describe the implementation of the principle of trust region methods:

  1. Trust region method to obtain the minimum point in time within the step range of space, seeking what is the minimum point of the function, is the original minimum point loss function do?

2. Determine after completing step, in the long-range spatial step is how to obtain the minimum points?

3. What steps is determined?

The following start one by one.

Question one

Trust region method to obtain the minimum point in time within the step range of space, seeking what is the minimum point of the function, is the original minimum point loss function do?
Extreme point of not seeking the original function. Trust region method there with Newton's method is very similar, is also a Taylor expansion at the initial point, taking the same second-order type, that is as follows

[f(x) approx f({x_k}) + nabla f{({x_k})^{rm T}}(x - {x_k}) + frac{1}{2}{(x - {x_k})^{rm T}}{nabla ^2}f({x_k})(x - {x_k})]

Then simplified, so that (d = x-x_k) to give

[varphi (d) = f({x_k})nabla f{({x_k})^{rm T}}d + frac{1}{2}{d^{rm T}}{nabla ^2}f({x_k})d]

We original loss function (x_k) expanded at a second order loss function takes the original formula for approximation, and according to certain rules, to determine a step size to expand the approximation function for the minimum in the space, the actual mathematical formula is as follows,

[ begin{cases} varphi (d) = f({x_k})nabla f{({x_k})^{rm T}}d + frac{1}{2}{d^{rm T}}{nabla ^2}f({x_k})d\ ||d|| le {h_k}\ end{cases} ]

Under explain why limit (|| d ||) , we know (= the X-d-x_k) , we can understand we have to find the optimum point for (the X-) , developed length from the point of our present.

Question two

After completion of the step size is determined, in the long-range spatial step is how to obtain the minimum point?
In answer to the first question, we know that the loss of trust region method to function after a certain point swing, take the second-order type, and then determine a step, and then seek its minimum. Obviously, this is an optimization problem with a constraint, especially in seeking the optimization problem of inequality constraints we used to is the KKT conditions with constraints, here we talk briefly under the KKT conditions, detailed look at me summary of another article: KKT condition derivation (not updated). KKT condition specifically for solving optimization problems with inequality constraints, it is the objective function of the problem of inequality and equality constraints to be optimized into the equations can be solved, that is, if the original problem electrode value solution, then the extreme value solution to meet certain conditions, KKT (that is a necessary condition), which satisfies the condition can be written as a system of equations, the solution to this equation.

Question three

What steps is determined?
When we find the minimum point in step spatial range, it is necessary to determine the next step range (Note that in the initial algorithm, we will give an initial step size and the initial point). Determine the next step in principle, look at the minimum point is obtained, the value of the actual loss decreases compared with the quadratic function approximation formula prediction drop value, if the actual value drops expected to meet, such that the function come to the point (update (x_k + 1) ), and continue to use the step-by-step, and even expand step, if the actual decline does not meet expectations, the narrow steps, if very satisfied, not even updates ( +. 1 x_k) , using the following equation described below. Suppose a function determined optimum value (d_k) , that is, we want to update the vector, the actual decrease in the value (f ({x_k}) - f ({x_k} + {d_k})) decreased our prediction value of the original values, we subtracted the minimum approximation function, namely (f ({x_k}) - varphi ({d_k})) ratio of the two is
[{rho _k} = frac { f ({x_k} ) - f ({x_k} + {d_k})} {f ({x_k}) - varphi ({d_k})}] we according to this ratio may be from a defined step update rule exemplified below. Trust region is assumed that the initial radius (H_1) , the initial point (x_1) , the best extreme point is determined (D_1) , then we assume that two parameters (0 <mu <eta <1 ) update if Iterative ratio(rho_1 le mu) , there are no update (x_2) , namely (x_1 = x_2) , if (rho_1> MU) , the (x_1 + x_2 = D_1) . Update trust region if the radius (rho_1 le mu) , then reduce the trust region radius, (H_2 = FRAC {. 1} {2} H_1) , if (MU <rho_1 <ETA) , then let (H_2 = H_1) , if ( GE ETA rho_1) , then let (h_2 = 2h_1) followed by iterative down.

Trust Region Algorithm steps:

(1) from the initial point (x_0) , the initial trust region radius (h_0) iterating

(2) when the k-th step, quadratic approximation formula is obtained

(3) Solution trust domain model, obtains displacement (d_k) , calculated (rho_k)

(4) if (rho_k≤0.25) , go far described too, the trust region radius should be reduced, so that (h_k + = FRAC. 1} {2} {h_k. 1) , should not come at that time, and should be "standing still", i.e., (x_k + 1 = x_k)

(5) if (rho_k≥0.75) and (|| || = d_k h_k) , to illustrate this step by step has been to the brink of trust region radius, and the pace a little, you can try to expand the trust region radius, so that (h_k + 1 = 2h_k) , you can go to the next point, i.e., (x_k + 1 = x_k + d_k )

(6) If (0.25 <rho_k <0.75) , this step is taken to be described later, is between "reliable" and "unreliable", the current can be maintained trust region radius, so that (h_k + = h_k. 1) , and you can go to the next point, i.e., (x_k + 1 = x_k + d_k )

Emotional understanding of trust region algorithm

Optimization algorithms in iterative thinking:

Since the objective function is too complex to be disposable extreme points determined function, we find a random point, and as the objective function approximation functions instead of using a simple function near a slightly at this point, because a point nearby function may be considered smooth, then the approximation function determined extremum, namely the objective function near the extremum point, and then update the point, instead of continuing with the objective function approximated function at this point, again to obtain a new electrode value, a time of extreme points of approximation of the original objective function.

The reason the trust region radius update rules

When the actual decline in the amount of decline is far less than the amount forecast, shows that we selected step size is too large, why do you say so? Because we know that only in the expansion approximation function at a point close to the target function in the vicinity, but the step size is too large, our approximation function in the more distant objective function will vary greatly, so think approximation function decline in larger point, the actual target function does not decline much, so we have to narrow steps to ensure the original function approximation function with little difference, thus ensuring that the direction of fall is right. Also if the decline was in line with expectations, indicating what approximation function that we are very close to the target function, two possibilities, one is a large range in the vicinity of the objective function at this point is smooth, one is our step size selected too small, approximation function in a very small step in the long range is very close to the target function, either may be, we must expand step to accelerate the convergence.

The importance of Taylor expansion

Taylor expansion feeling is that these basic optimization algorithms, the complex objective function expansion at some point as an approximate objective function is the whole essence of the optimization algorithm.

Original: Large columns  trust region algorithm principle


Guess you like

Origin www.cnblogs.com/petewell/p/11607001.html