(Transfer) RF, GBDT, XGBoost, lightGBM principle and difference

RF, GBDT, and XGBoost are all ensemble learning (Ensemble Learning). The purpose of ensemble learning is to improve the generalization ability and robustness of a single learner by combining the prediction results of multiple base learners. 
  According to the generation method of individual learners, the current integrated learning methods are roughly divided into two categories: that is, there is a strong dependency between individual learners, a serialization method that must be generated serially, and there is no strong dependency between individual learners , Parallelization methods that can be generated at the same time; the former is represented by Boosting, and the latter is represented by Bagging and "Random Forest" (Random Forest).

1. The  
principle of  RF 1.1 When
  mentioning random forest, you have to mention Bagging. Bagging can be simply understood as: replacement sampling, majority voting (classification) or simple average (regression), and the base learners of Bagging belong to parallel generation. , There is no strong dependency. 
  Random Forest (random forest) is an extended variant of Bagging. Based on the decision tree-based learner to build Bagging integration, it further introduces random feature selection in the training process of the decision tree, so it can be summarized that RF includes four Part: 1. Random selection of samples (return sampling); 2. Random selection of features; 3. Construction of decision trees; 4. Random forest voting (average). 
  Random selection of samples is the same as Bagging. Random selection of features means that in the construction of the tree, some features will be randomly selected from the feature set of the sample set, and then the optimal attribute will be selected from this subset for division. This randomness This leads to a slight increase in the bias of the random forest (compared to a single non-random tree), but due to the'average' characteristics of the random forest, its variance will be reduced, and the reduction in variance compensates for the increase in bias , So overall it is a better model. 
  When constructing a decision tree, each decision tree of RF grows as far as possible without pruning; when combining the prediction output, RF usually uses a simple voting method for classification problems, and a simple average method for regression tasks. 
  The important feature of RF is that it does not need to be cross-validated or use an independent test set to obtain an unbiased estimate. It can be evaluated internally, which means that an unbiased estimate of the error can be made during the generation process. The learner only uses about 63.2% of the samples in the training set, and the remaining about 36.8% of the samples can be used as the validation set to perform "out-of-package estimation" of its generalization performance. 
  Comparison of RF and Bagging: The initial performance of RF is poor, especially when there is only one base learner. As the number of learners increases, the random forest usually converges to a lower generalization error. The training efficiency of random forest is also higher than that of Bagging, because in the construction of a single decision tree, Bagging uses a'deterministic' decision tree. When selecting feature division nodes, all features must be considered, while random forest The number of'randomness' features is used, and only a subset of features needs to be considered. 
1.2 
  Advantages and disadvantages There are many advantages and disadvantages of random forest, which can be summarized briefly: 1. It performs well on the data set and has greater advantages over other algorithms (training speed and prediction accuracy); 2. It can handle very high-dimensional data. And no feature selection is needed, and after training, the importance of features is given; 3. It is easy to make a parallelization method.
  Disadvantages of RF: overfitting on noisy classification or regression problems. 
2. 
  Before GBDT mentions GBDT, let's talk about Boosting. Boosting is a technology very similar to Bagging. Whether it is Boosting or Bagging, the multiple classifier types used are the same. But in the former, different classifiers are obtained through serial training, and each new classifier is trained based on the performance of the trained classifier. Boosting obtains a new classifier by focusing on the data that has been misclassified by the existing classifier. 
  Since the result of Boosting classification is based on the weighted summation results of all classifiers, Boosting is not the same as Bagging. The weights of the classifiers in Bagging are the same, while the weights of the classifiers in Boosting are not equal. Each weight Represents the success of the corresponding classifier in the previous iteration. 
2.1 Principle 
  GBDT is quite different from traditional Boosting. Every calculation of GBDT is to reduce the residual error of the previous time. In order to eliminate the residual error, we can build the model in the gradient direction of the residual error reduction. So, in the GradientBoost Each new model is established to make the residual of the previous model descend to the gradient, which is very different from the traditional Boosting, which focuses on correct and wrong sample weighting. 
  In the GradientBoosting algorithm, the key is to use the value of the negative gradient direction of the loss function in the current model as the approximate value of the residual, and then fit a CART regression tree. 
  GBDT will accumulate the results of all trees, and this accumulation cannot be completed by classification, so GBDT trees are CART regression trees, not classification trees (although GBDT can also be used for classification after adjustment, it does not represent GBDT trees Is a classification tree). 
2.2 
  Advantages and Disadvantages The performance of GBDT is further improved on the basis of RF, so its advantages are also obvious. 1. It can flexibly process various types of data; 2. In a relatively small adjustment time, it is predicted The accuracy is high. 
  Of course, because it is Boosting, there is a serial relationship before the base learner, and it is difficult to train data in parallel.

3.  
Principle of  XGBoost 3.1
  The performance of XGBoost has been further improved on GBDT, and its performance can also be seen through various competitions. The biggest cognition of XGBoost is that it can automatically use the CPU's multi-threading to perform parallel calculations, and at the same time, the accuracy of the algorithm has also been improved. 
  Because GBDT often needs to generate a certain number of trees to achieve satisfactory accuracy under reasonable parameter settings, when the data set is complex, the model may require thousands of iterations. But XGBoost uses parallel CPUs to better solve this problem. 
3.2 Advantages 
  1. The traditional GBDT uses the CART tree as the base learner. XGBoost also supports linear classifiers. At this time, XGBoost is equivalent to L1 and L2 regularized logistic regression (classification) or linear regression (regression); 
traditional GBDT When optimizing, only the first-order derivative information is used. XGBoost performs a second-order Taylor expansion on the cost function to obtain the first-order and second-order derivatives; 
  2. XGBoost adds a regular term to the cost function to control the complexity of the model degree. From the perspective of weighing variance deviation, it reduces the variance of the model, makes the learned model simpler, and places over-fitting. This is also a feature of XGBoost superior to traditional GBDT;
  3. shrinkage  (reduction), which is equivalent to the learning rate ( Eta in XGBoost). XGBoost will multiply the weight of the leaf node by this coefficient when it performs an iteration, mainly to weaken the influence of each tree, so that there is more room for learning later. (GBDT also has a learning rate); 
  4. Column sampling. XGBoost draws on the approach of random forest and supports column sampling, which not only prevents over-fitting, but also reduces calculations; 
  5. Treatment of missing values. For samples with missing feature values, XGBoost can also automatically learn its splitting direction; 
  6. XGBoost tool supports parallelism. Isn't Boosting a serial structure? How parallel? Note that XGBoost's parallelism is not tree-granularity parallelism. XGBoost can only perform the next iteration after one iteration (the cost function of the tth iteration contains the predicted value of the previous t-1 iteration). XGBoost's parallelism is on the feature granularity. We know that one of the most time-consuming steps of decision tree learning is to sort the value of the feature (because the best split point is to be determined). Before training, XGBoost sorts the data in advance and saves it as a block structure. This structure is repeatedly used in iterations, which greatly reduces the amount of calculation. This block structure also makes parallel possible. When splitting nodes, you need to calculate the gain of each feature, and finally select the feature with the largest gain to split, then the gain calculation of each feature can be performed in multiple threads. 
3.3 Disadvantages 
  1. The level-wise construction method treats all leaf nodes in the current layer equally. Some leaf nodes have very little splitting income and have no effect on the result, but they still need to be split, which increases the computational cost. 
  2. The pre-sorting method consumes a lot of space. It not only needs to save the feature value, but also the sort index of the feature. At the same time, it consumes a lot of time. When traversing each split point, the split gain must be calculated (but this shortcoming can be approximated Overcome) 
4. LightGBM  
4.1 and XGboost comparison 
  1. xgboost uses a level-wise split strategy, while lightGBM uses a leaf-wise strategy, the difference is that xgboost splits all nodes in each layer indiscriminately, and there may be some nodes The gain of is very small and has little effect on the result, but xgboost also splits it, which brings necessary overhead. The leaft-wise approach is to select the node with the largest splitting profit among all the current leaf nodes to split. This is done recursively. Obviously, the leaf-wise approach is easy to overfit, because it is easy to fall into a relatively high depth, so it is necessary to The maximum depth is limited to avoid overfitting. 
  2. Lightgbm uses a decision tree algorithm based on histogram. This is different from the exact algorithm in xgboost. The histogram algorithm has significant advantages in memory and computational cost. 
  (1) Memory advantage: Obviously, the memory consumption of the histogram algorithm is (#data* #features * 1Bytes) (because only the value after the feature discretization is saved after the feature is bucketed), and the exact algorithm memory of xgboost The cost is: (2 * #data * #features* 4Bytes), because xgboost not only saves the original feature value, but also the sequential index of this value. These values ​​need 32-bit floating point numbers to save. 
  (2) Computational advantages. The pre-sorting algorithm needs to traverse the feature values ​​of all samples when selecting the split features to calculate the split income. The time is (#data), while the histogram algorithm only needs to traverse the bucket, and the time is (# bin) 
  3. Accelerate the histogram difference. The histogram of 
a child node can be obtained by subtracting the histogram of the sibling node from the histogram of the parent node to speed up the calculation. 
  4. Lightgbm supports direct input of categorical 
features. When splitting discrete features, each value is treated as a bucket. The gain during splitting is the gain of "whether it belongs to a category". Similar to one-hot encoding. 
  5. Multi-threaded optimization

参考 : 
1 、http://blog.csdn.net/qq_28031525/article/details/70207918 
2 、https://www.cnblogs.com/mata123/p/7440774.html 
3 、https: //www.cnblogs. com / infaraway / p / 7890558.html

4、https://blog.csdn.net/bbbeoy/article/details/79590981

Guess you like

Origin blog.csdn.net/sinat_39307513/article/details/96736719