Random forest, GBDT, XGBoost comparison

RF , GBDT and XGBoost all belong to ensemble learning ( Ensemble Learning ). The purpose of ensemble learning is to improve the generalization ability and robustness of a single learner by combining the prediction results of multiple base learners.  
  According to the generation method of individual learners, the current integrated learning methods are roughly divided into two categories: that is, there is a strong dependency between individual learners, a serialization method that must be generated serially, and there is no strong dependency between individual learners , Parallelization methods that can be generated at the same time; the former is represented by Boosting , and the latter is represented by Bagging and " Random Forest " ( Random Forest ).

1. Random Forest

    1.1 Principle

When it comes to random forest, you have to mention Bagging . Bagging can be simply understood as: replacement sampling, majority voting (classification) or simple average (regression) , and the base learners of Bagging belong to parallel generation, and there is no strong dependence. relationship.  
  Forest the Random (Random Forests) is Bagging extension variant, it is decision tree is constructed based learners Bagging based on the integrated, wherein further introduction of random selection in the training process in the decision tree, and therefore can be summarized RF comprise four Part: 1. Random selection of samples (return sampling); 2. Random selection of features; 3. Construction of decision trees; 4. Random forest voting (average).  
  Sample was randomly selected and Bagging same random selection feature refers to a construct in the tree will be selected partial feature set from the feature set of a random sample, and then choose the best from this genus subset of for dividing, this random leads to a deviation of the random forest be a slight increase (compared to a single random tree without trees), but due to the random forests ' average ' characteristic, such that it will reduce the variance, and the variance is reduced by compensating for the deviation Big, so overall a better model. 
  (As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias , hence yielding an overall better model.) 
  When building a decision tree, each decision tree of RF grows as far as possible without pruning; when combining the predicted output, RF usually uses simple voting for classification problems Method, the regression task uses the simple average method.  The important feature of
  RF is that it does not need to be cross-validated or use an independent test set to obtain an unbiased estimate. It can be evaluated internally, which means that an unbiased estimate of the error can be made during the generation process. The learner only uses about 63.2% of the samples in the training set , and the remaining about 36.8% of the samples can be used as the validation set to "out-of- package estimation " of its generalization performance .  Comparison of
  RF and Bagging : The initial performance of RF is poor, especially when there is only one base learner. As the number of learners increases, the random forest usually converges to a lower generalization error. The training efficiency of random forest will also be higher thanOn Bagging , because in the construction of a single decision tree, on Bagging using a ' deterministic ' tree, select the feature division at the node, all should be considered features, and using the random forest ' randomness ' characteristic number , Only a subset of features needs to be considered.

    1.2 Advantages and disadvantages

There are many advantages of random forest, which can be summarized briefly: 1. It performs well on the data set and has greater advantages over other algorithms (training speed, prediction accuracy);  2. It can handle very high-dimensional data without using features Choose, and after training, give the importance of features; 3. Easy to make a parallel method.  Disadvantages of
  RF : overfitting on noisy classification or regression problems.

2. GBDT

Before    mentioning GBDT , let’s talk about Boosting . Boosting is a technology very similar to Bagging . Regardless of whether it is Boosting or Bagging , the multiple classifier types used are the same. But in the former, different classifiers are obtained through serial training, and each new classifier is trained based on the performance of the trained classifier. Boosting obtains a new classifier by focusing on the data that has been misclassified by the existing classifier.  
  Since Boosting result of classification is based on a weighted sum of the results of all classifiers, thus Boosting and Bagging not the same, Bagging classifier weights are the same, and Boosting classifier weights are not equal weight, each weight Represents the success of the corresponding classifier in the previous iteration.

    2.1 Principle

GBDT traditional and Boosting vary widely, it is calculated every once in order to reduce the residuals, and in order to eliminate residual, we can build models on a reduced residual gradient direction , so that, in GradientBoost in Each new model is established to make the residual of the previous model descend to the gradient, which is very different from the traditional Boosting , which focuses on the correct and wrong sample weighting.  
  In the GradientBoosting algorithm, the key is to use the value of the negative gradient direction of the loss function in the current model as the approximate value of the residual, and then fit a CART regression tree.  
  GBDT will accumulate the results of all trees, and this accumulation cannot be completed by classification, so GBDT trees are CART regression trees, not classification trees (although GBDT can also be used for classification after adjustment, it does not represent GBDT trees Is a classification tree).

2.2 Advantages and disadvantages

The performance of GBDT has been improved on the basis of RF , so its advantages are also obvious. 1. It can handle various types of data flexibly; 2. The accuracy of prediction is better in a relatively small adjustment time. high.  
  Of course, because it is Boosting , there is a serial relationship before the base learner, and it is difficult to train data in parallel.


3. XGBoost

    3.1 Principle

XGBoost performance in GBDT another step on the upgrade, and its performance through a variety of game Glimpse 12.. The biggest cognition of XGBoost is that it can automatically use the CPU 's multi-threading to perform parallel calculations, and at the same time, the accuracy of the algorithm has also been improved.  
  Because GBDT often needs to generate a certain number of trees to achieve satisfactory accuracy under reasonable parameter settings, when the data set is complex, the model may require thousands of iterations. But XGBoost uses parallel CPUs to better solve this problem.  
  In fact , the difference between XGBoost and GBDT is also large, which is also reflected in its performance.

4. Difference

    4.1 Difference between GBDT and XGBoost

  1. The traditional GBDT uses the CART tree as the base learner, and XGBoost also supports linear classifiers. At this time, XGBoost is equivalent to L1 and L2 regularized logistic regression (classification) or linear regression (regression);
  2. Traditional GBDT only uses first-order derivative information when optimizing, while XGBoost performs second-order Taylor expansion on the cost function to obtain first-order and second-order derivatives;
  3. XGBoost adds a regular term to the cost function to control the complexity of the model. From the perspective of weighing the variance deviation, it reduces the variance of the model, makes the learned model simpler, and places overfitting. This is also a feature of XGBoost over traditional GBDT ;
  4. shrinkage (reduction), equivalent to the learning rate ( XGBoost in ETA ). XGBoost will multiply the weight of the leaf node by this coefficient when it performs an iteration, mainly to weaken the influence of each tree, so that there is more room for learning later. ( GBDT also has a learning rate);
  5. Column sampling. XGBoost draws on the approach of random forest and supports column sampling, which not only prevents overfitting, but also reduces calculations;
  6. Treatment of missing values. For samples with missing feature values, XGBoost can also automatically learn its splitting direction;
  7. XGBoost tools support parallelism. Isn't Boosting a serial structure ? How parallel ? Note that XGBoost 's parallelism is not tree- granularity parallelism. XGBoost can only perform the next iteration after one iteration ( the cost function of the tth iteration contains the predicted value of the previous t-1 iteration). XGBoost 's parallelism is on the feature granularity. We know that one of the most time-consuming steps in decision tree learning is to sort the values ​​of the features (because the best split point is to be determined). Before training , XGBoost sorts the data in advance and saves it as a block structure. This structure is repeatedly used in iterations , which greatly reduces the amount of calculation. This block structure also makes parallel possible. When splitting nodes, you need to calculate the gain of each feature, and finally select the feature with the largest gain to split, then the gain calculation of each feature can be performed in multiple threads.

 

Guess you like

Origin blog.csdn.net/gf19960103/article/details/89476839