XGBoost learning (two): introduction and installation

XGBoost Learning (1): Principle
XGBoost Learning (2): Installation and Introduction
XGBoost Learning (3): Detailed Model
XGBoost Learning (4): Practical
XGBoost Learning (5): Parameter Tuning
XGBoost Learning (6): Importance of Output Features And filter the complete code and data of the feature

Preface

1. Introduction to Xgboost

Xgboost is one of the Boosting algorithms. The idea of ​​the Boosting algorithm is to integrate many weak classifiers together to form a strong classifier. Because Xgboost is a boosted tree model, it integrates many tree models to form a strong classifier. The tree model used is the CART regression tree model.
  Xgboost is improved on the basis of GBDT to make it more powerful and applicable to a wider range.
  Xgboost is generally used with sklearn, but because Xgboost is not integrated in sklearn, it needs to be downloaded and installed separately.

2. Advantages of Xgboost

The Xgboost algorithm can bring improvements in predictive models. When we learn more about his performance, we will find that he has the following advantages:

2.1 Regularization

In fact, Xgboost is known for its "regularized boosting" technology. Xgboost adds a regular term to the cost function to control the complexity of the model. The regular term contains the number of leaf nodes of the tree, and the square sum of the L2 modulus of the score output on each leaf node. From the perspective of Bias-variance tradeoff, the regular term reduces the variance of the model, makes the learned model simpler, and prevents overfitting. This is also a feature of Xgboost superior to traditional GBDT

2.2 Parallel processing

The Xgboost tool supports parallelism. As we all know, the Boosting algorithm is processed sequentially. Isn't it a serial structure? How does it work in parallel? Note that Xgboost's parallelism is not tree-granular parallelism. Xgboost can only perform the next iteration after one iteration (included in the cost function of the t-th iteration). Xgboost's parallelism is in feature granularity, which means that the construction of each tree depends on the previous tree.
  We know that one of the most time-consuming steps of decision tree learning is to sort the values ​​of the features (because the best split point is to be determined). Before training, Xgboost sorts the data in advance, and then saves it as a block structure. This structure is reused in iteration, greatly reducing the amount of calculation. This block structure also makes parallelization possible. When classifying nodes, it is necessary to calculate the gain of each feature, which greatly reduces the amount of calculation. This block structure also makes parallelization possible. When splitting nodes, it is necessary to calculate the gain of each feature, and finally the feature with the largest gain is selected for splitting, then the gain calculation of each feature can be performed in multiple threads.

2.3 Flexibility

Xgboost supports user-defined objective function and evaluation function, as long as the objective function is second-order derivable. It adds a whole new dimension to the model, so our processing will not be restricted in any way.

2.4 Missing value processing

For samples with missing feature values, Xgboost can automatically learn its split direction. Xgboost has built-in rules for handling missing values. The user needs to provide a value that is different from other samples, and then pass it in as a parameter to use as the value of the missing value. Xgboost uses different processing methods when different nodes encounter missing values, and will learn how to deal with missing values ​​in the future.

2.5 Pruning

Xgboost first builds all the subtrees that can be built from top to bottom, and then reverses the movement from bottom to top. Compared with GBM, it is not easy to fall into a local optimal solution.

2.6 Built-in cross-validation

Xgboost allows the use of cross-validation in each iteration of Boosting. Therefore, the optimal number of Boosting iterations can be easily obtained, while GBM uses grid search and can only detect a limited number of values.

3. Offline installation of Xgboost

1. Download the whl corresponding to your own Python version.
https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost
Insert picture description here
2, enter the installed program:

pip3 install xgboost‑1.1.0‑cp37‑cp37m‑win_amd64.whl

3. Online installation:

pip3 install xgboost

Guess you like

Origin blog.csdn.net/qq_30868737/article/details/108010523