Talk about the difference between adaboost and GBDT and xgboost from the loss function

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/wangfenghui132/article/details/77887928

       adaboost and GBDT and xgboost loss function in the optimization method is to have a lot of different, different from the fact that three different optimization methods (such does not know if appropriate, at least stand in this point of view I think it is correct after many years you may find that this view is not appropriate). adaboost in Dr. Li Hang's "Statistical Learning Foundation" inside the additive model and algorithm forward to explain the weight update policy. In the process of interpretation, the sample weight updating and strike a weak classifier weights directly through partial derivatives are calculated is equal to zero, if not remember can go back and take a closer look. Whether or do a regression classification, adaboost goal is to find a value (by making the process of partial derivatives are estimated zero) makes direct loss function to minimize.

       GBDT using the gradient descent method so that the loss function decreased. First say why it would make such a method, then say why GBDT using gradient descent method will be reduced loss function. This occurs because the method is certainly for some loss of function, partial derivatives equal to zero is solvable, and for general loss function partial derivatives are equal to zero can not be solved, otherwise the gradient descent method does not exist. As to why the GBDT using a gradient descent method will reduce the loss of function. First loss function do ft-1 (x) at a first order Taylor expansion (where ft-1 (x) is, before t-1 weak classifiers integration), after deployment is a constant C and the gradient g and the new weak classifier f product (that is, L = C + gf), at this time if you want the loss function L becomes smaller, so that the product will gradient g and the weak classifier f becomes smaller, so that a negative gradient to fit the new weak classifiers. If the weak classifiers as the step size, the loss function on behalf of took a step along the gradient. If the new weak classifiers perfect fit with the negative gradient, then the loss function becomes L = C + g (-f) = C-gg, so this time is decreased by the loss function. If this is the case, as long as the new weak classifier f direction and gradient vector g contrary, vector mode as long as possible while not big on the line yet? The answer of course is not enough, because what we do is the loss of function in Taylor ft-1 (x) expansion ah, if the step is too long, then take a step in the opposite direction of the gradient, the error is big. So have a little step. In this mode do not know why equality is defined gradient mode and step size of f, Freidman may have to consider it, if I follow to improve, I'll start from that place, choose how long the step is -a * g to fit the new weak classifiers f, where a> 0.

       xgboost loss function is also done in ft-1 (x) at the Taylor expansion, the difference is to do a second order Taylor expansion, but also comes with constraints on the leaves nodes, weight-mode long (in fact, can be considered as a step) constraints regular items. Of course, as in GBDT which you can do these regularization constraints, there is not xgboost unique, but the difference is that xgboost loss function to do a second order Taylor expansion in ft-1 (x) at. To commence the second-order Taylor from start to optimize the loss function.

      These are three different methods above it. First few formulas in this article for more access to the literature, there is a preliminary understanding of the people of these things, will have looked at the effect. If you are not familiar with these, following these words is not what inspired you might just heavenly.

        

Guess you like

Origin blog.csdn.net/wangfenghui132/article/details/77887928