Comparison of random forest, GBDT, XGBoost

Comparison of random forest, GBDT, XGBoost

Random Forest RF RandomForest

  The ensemble learning method of random forest is bagging  , but different from bagging is that bagging only uses bootstrap to have back sampling samples, but random forest randomly samples samples and randomly selects features, so the ability to prevent over-fitting is stronger and the variance is reduced .

Fusion method used: bagging

  • An ensemble learning algorithm, based on the bootstrap sampling self-sampling method, randomly selects some samples for training, and then votes or averages the results.
  • It is a parallel algorithm because different base learners are independent
  • The time complexity of training a bagging ensemble learner is the same as that of the base learner (n times, where n is the number of base learners).
  • Bagging can be used for binary classification/multi-classification/regression
  • Unused training samples of each base learner can be used for out-of-bag estimation to evaluate generalization performance.
  • Bagging mainly focuses on reducing  variance .
  • Two steps 1. Sampling training (sampled samples, sampled features) 2 Fusion

Random forest further adds random attribute selection to decision tree training:

  If there are M input variables, each node will randomly select m (m < M) specific variables, and then use these m variables to determine the best split point. During the generation of the decision tree, the value of m remains unchanged . m generally takes the root mean square of M

  Therefore, random forests have both sample randomness (boostrap sampling from bagging) and feature randomness.

Advantages and disadvantages of random forests

advantage:

a) Random forest algorithm can solve two types of problems of classification and regression, and has good performance. Because it is ensemble learning, variance and bias are relatively low, and the generalization performance is excellent;
b) Random forest has good processing ability for high-dimensional data sets , which can handle thousands of input variables and determine the most important ones, so it is considered a good dimensionality reduction method. In addition, the model is able to output the importance level of features, which is a very useful function.
c) Can deal with missing data;
d) When there is a classification imbalance, random forest can provide an effective method to balance the error of the dataset;
e) Highly parallelized and easy to implement distributed
f) Because it is a tree model, no normalization is required Can be used between

shortcoming:

a) Random Forest does not perform as well for regression as it does for classification because it does not give a continuous output. When performing regression, random forests are not able to make predictions beyond the range of the training set data, which can lead to overfitting when modeling some data that also has specific noise.
b) To many statistical modelers, random forests feel like a black box - you have little control over how the model works inside, and can only experiment between different parameters and random seeds.
c) ignore dependencies between attributes

RF parameter tuning

General tuning method:

  1. grid search Grid search. sklearn provides the corresponding square GridSearchCV. That is, cross validation is used to cross-validate the selected candidate parameters of the model iteration, and the parameters with the best results are selected. Disadvantage: Computational complexity. Generally, this is the choice for small projects that do competitions.
  2. Greedy-based coordinate descent search. That is, other parameters are fixed, and a certain parameter is the best. Iterate this way to get the final result. Advantages: Less computation, Disadvantages: May not be the global optimal value, fall into the local optimal solution.
  3. Random grid search: Prevent grid search interval from being too large and skip the optimal value, while random can take more values ​​than a single parameter.
  •  The more n_estimators, the more stable the result (the smaller the variance), so as long as it is allowed, the larger the number, the better, but the amount of calculation will be greatly increased. Only the influence of this parameter on the result is the bigger the better, and the other parameters are the optimal values ​​in the middle.
  • The "criterion" has different effects on the accuracy of the model. This parameter needs to be adjusted flexibly in actual application. Is it preferable to use gini or information gain ratio? .
  • The maximum number of features per tree ( max_features) is generally used sqrt (total number of features).
  • Adjusting one of the "maximum number of leaf nodes" (max_leaf_nodes) and "maximum tree depth" (max_depth) can adjust the structure of the tree in a coarse-grained manner: the more leaf nodes or the deeper the tree, the lower the bias of the submodel, the lower the variance The higher the value; at the same time, by adjusting the "minimum number of samples required for splitting" (min_samples_split), "minimum number of samples of leaf nodes" (min_samples_leaf) and "minimum weight of leaf nodes" (min_weight_fraction_leaf), the structure of the tree can be adjusted more finely. : The smaller the number of samples required for splitting or the fewer samples required for leaf nodes, the more complex the submodel.

Generalization of Random Forests (Extra Trees)

Extra trees are a variant of RF, the principle is almost the same as RF, the only differences are:

1) For the training set of each decision tree, RF uses random sampling bootstrap to select the sampling set as the training set of each decision tree, while extra trees generally do not use random sampling, that is, each decision tree uses the original training set.

2) After the division feature is selected, the RF decision tree will select an optimal eigenvalue division point based on the principles of information gain, Gini coefficient, and mean square error, which is the same as the traditional decision tree. But extra trees are more aggressive, they will randomly select a feature value to divide the decision tree.

It can be seen from the second point that the size of the generated decision tree is generally larger than that generated by RF because the division point of the eigenvalue is randomly selected instead of the optimal position. That is, the variance of the model is further reduced relative to RF, but the bias is further increased relative to RF. Extra trees generalize better than RF at some point

 

GBDT (Gradient Boosting Decision Tree)

The basic principle of gbdt is  the  boosting tree in boost and uses gradient boost .

 

The trees in GBDT are all regression trees, not classification trees, because gradient boost needs to approximate the fitting residuals according to the gradient of the loss function, which fits continuous values, so there are only regression trees.

Gradient boosting gradient boosting:

Gradient Boosting is a Boosting method. The difference from traditional Boosting is that each calculation is to reduce the previous residual, and in order to eliminate the residual, the gradient of the residual reduction can be reduced (Gradient) Create a new model in the direction. Therefore, in Gradient Boosting, the establishment of each new model is to reduce the residual of the previous model in the direction of the gradient, which is very different from the traditional Boosting which weights the correct and wrong samples. This gradient represents the derivation of the predicted value by the previous learner loss function.

Difference from Boosting Tree: Boosting Tree's suitable loss function is square loss or exponential loss. Gradient Boosting is suitable for various loss functions (loss function: square loss is equivalent to Boosting Tree fitting residual, loss function: using exponential loss can be approximated to Adaboost, but the tree is a regression tree)

For the gradient boosting tree, the learning process is similar to that of the boosting tree, except that the residual is no longer used as the new training data, but the gradient of the loss function is used as the y value of the new new training data. Specifically, the loss function is used for f (x) find the gradient and then bring it into f m-1 (x) to calculate:

The relationship between GDBT and boosted trees:

Each time the boosting tree model is improved, it is retrained by the difference between the last prediction result and the label value of the training data as new training data. GDBT replaces the residual calculation with the gradient direction of the loss function. The one-time prediction result is brought into the gradient to obtain the training data of the current round. These two models use different methods when generating new training data, so what is the difference behind this? What's wrong with using residuals?

Mr. Li Hang's "Statistical Learning Method" mentioned that when using the squared error loss function and the exponential loss function, the residual of the boosting tree is relatively simple to solve, but when using the general loss error function, the residual is not so easy to solve. So it is to use the value of the negative gradient of the loss function in the current model as an approximation of the residual (mean squared error) or residual in the regression problem. 

 

(from MLAPP)

 

XGBoost

 

 

XGBoost is better than GBDT:
the second-order Taylor expansion
node score penalty regular
gain calculation is different, gbdt is gini, xgb is the optimization derivation formula

The following comes from  a step-by-step understanding of GB, GBDT, xgboost

Xgboost is an efficient implementation of the GB algorithm. The base learner in xgboost can be a CART (gbtree) or a linear classifier (gblinear). All the content below is from the original paper, including the formulas.

(1). The regularization term is added to the objective function of xgboost. When the basic learning is CART, the regularization term is related to the number T of leaf nodes of the tree and the value of the leaf nodes.

(2). The first-order derivative of f(x) is used in GB to calculate the pseudo-residual for learning to generate fm(x). xgboost uses not only the first-order derivative, but also the second-order derivative .

The t-th loss:

Do the second-order Taylor expansion of the above formula: g is the first-order derivative, h is the second-order derivative

(3). As mentioned above , the criterion for finding the best split point in the CART regression tree is to minimize the mean square error . The criterion for xgboost to find the split point is to maximize, and lamda and gama are related to the regularization term.

The steps of the xgboost algorithm are basically the same as those of GB. They are first initialized to a constant, gb is based on the first-order derivative ri, and xgboost is based on the first-order derivative gi and the second-order derivative hi, iteratively generates the base learner, and adds and updates the learner.

In addition to the above three differences between xgboost and gdbt, xgboost has also made many optimizations in its implementation :

  • When looking for the best split point, considering that the traditional greedy method of enumerating all possible split points for each feature is too inefficient, xgboost implements an approximate algorithm. The general idea is to enumerate several candidates that may become segmentation points according to the percentile method, and then calculate the best segmentation point from the candidates according to the formula for finding the segmentation point above.
  • xgboost considers the case where the training data is sparse, and can specify the default direction of the branch for missing values ​​or specified values, which can greatly improve the efficiency of the algorithm. The paper mentions 50 times.
  • Feature columns are sorted and stored in memory in blocks, which can be reused in iterations; although boosting algorithm iterations must be serial, each feature column can be processed in parallel.
  • Storing in the feature column mode can optimize the search for the best split point, but when the gradient data is calculated in rows, it will lead to discontinuous access to the memory, and in severe cases, it will lead to cache misses and reduce the efficiency of the algorithm. As mentioned in the paper, the data can be collected into the buffer inside the thread first, and then calculated to improve the efficiency of the algorithm.
  • xgboost also considers how to effectively use the disk when the amount of data is relatively large and the memory is not enough, mainly by combining multi-threading, data compression, and fragmentation methods to improve the efficiency of the algorithm as much as possible.

 

The following content is from wepon
Author: wepon
link: https://www.zhihu.com/question/41354392/answer/98658997

 

  • Traditional GBDT uses CART as the base classifier, and xgboost also supports linear classifiers. At this time, xgboost is equivalent to logistic regression (classification problem) or linear regression (regression problem) with L1 and L2 regularization terms.
  • Traditional GBDT only uses first-order derivative information in optimization, while xgboost performs second-order Taylor expansion on the cost function , and uses first-order and second-order derivatives at the same time . By the way, the xgboost tool supports custom cost functions , as long as the function can take first and second order derivatives.
  • xgboost adds a regular term to the cost function to control the complexity of the model. The regular term contains the number of leaf nodes of the tree and the sum of the squares of the L2 modulus of the score output on each leaf node. From the perspective of Bias-variance tradeoff, the regular term reduces the variance of the model, making the learned model simpler and preventing overfitting, which is also a feature of xgboost superior to traditional GBDT.
  • Shrinkage, equivalent to the learning rate (eta in xgboost). After xgboost completes one iteration, it will multiply the weight of the leaf node by this coefficient, mainly to weaken the influence of each tree, so that there is more room for learning later. In practical applications, eta is generally set smaller, and then the number of iterations is set larger. (Supplement: The implementation of traditional GBDT also has a learning rate)
  • Column subsampling is feature sampling . Xgboost draws on the practice of random forest and supports column sampling, which can not only reduce overfitting , but also reduce computation . This is also a feature of xgboost that is different from traditional gbdt.
  • Handling of missing values. For samples with missing feature values, xgboost can automatically learn its split direction.
  • The xgboost tool supports parallelism . Isn't boosting a serial structure? How parallel? Note that the parallelism of xgboost is not the parallelism of tree granularity, and xgboost can only proceed to the next iteration after one iteration (the cost function of the t-th iteration contains the predicted value of the previous t-1 iteration). The parallelism of xgboost is at feature granularity. We know that one of the most time-consuming steps in decision tree learning is to sort the values ​​of the features (because the best split point is to be determined). Before training, xgboost sorts the data in advance, and then saves it as a block structure. Repeated use of this structure in iterations greatly reduces the amount of computation. This block structure also makes parallelization possible. When splitting nodes, the gain of each feature needs to be calculated, and finally the feature with the largest gain is selected for splitting, then the gain calculation of each feature can be performed in multiple threads.
  • A parallelizable approximate histogram algorithm. When the tree node is split, we need to calculate the gain corresponding to each split point of each feature, that is, use the greedy method to enumerate all possible split points. When the data cannot be loaded into memory at one time or in a distributed situation, the efficiency of the greedy algorithm will become very low, so xgboost also proposes a parallel approximate histogram algorithm to efficiently generate candidate segmentation points.
  • Multiple language packaging support.

XGBoost parameter tuning

 

References:
chentq's slides 
's paper 
's Chinese blog post on 52cs 
Introduction and practice of xgboost shared on Weibo.pdf

 


Reprinted from http://www.cnblogs.com/sarahp/p/6900572.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325523741&siteId=291194637