Python implementation Xgboost model

Copyright: As this article have questions, please contact the author micro letter kxymxzs, welcome harassment! https://blog.csdn.net/MG_ApinG/article/details/87934052

1. Download xgboost package file: https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost

2. Installation Package xgboost: command input manager pip install G: \ GoogleDownload \ xgboost

3.xgboost advantages:

Advantages: (1) regularization: XGBoost added to the regularization term in the cost function, the control complexity for the model. Regularization term comprising a number of leaf nodes in the tree, the square of the L2 mode output score on each node and leaf. From Bias-variance tradeoff point of view, regularization reduces variance model to make the model easier to learn out to prevent over-fitting, which is xgboost better than a traditional characteristic of GBDT.

(2) Parallel processing: XGBoost tool supports parallel. Boosting is not a serial structure it? How parallel? Note XGBoost parallel tree is not parallel granularity, XGBoost is finished next iteration to the next iteration (the cost function is the t-th iteration contains the times t-1 preceding the predicted value of the iteration). It is in parallel XGBoost feature size. We know that the decision tree learning the most time-consuming step is to sort of value characteristics (as to determine the best split point), XGBoost before training, pre-sort the data, and then save it as a block structure, behind iteration of this structure is repeatedly used, the amount of computation is greatly reduced. The block structure also makes parallel as possible, during the split node, you need to calculate the gain of each feature, and ultimately selected the biggest gain characteristics that do split, then the gain calculated for each feature can open more threads.

(3) Flexibility: XGBoost supports user-defined objective function and evaluation function, as long as the objective function of the second order can lead on the line.

(4) missing values: sample values ​​for missing features, xgboost automatically learn the direction of its division

(5) Pruning: XGBoost start to establish the top in the end all subtrees can be established, and then from bottom to top in reverse prune. Compared to GBM, this is not easy to fall into local optimal solution.

(6) Internal cross-validation: XGBoost allows the use of a cross-validation at each boosting iterations. Therefore, it can easily obtain the optimal boosting the number of iterations. GBM using the grid search, detect only a limited number of values.

4.XGBoost arguments detailed

Before running XGboost, you must set three types of parameters: general parameters, booster parameters and task parameters:

4.1 General Parameters (General Parameters ): This parameter controls the lift parameter (Boosting) used during which booster, the booster has a common tree model (Tree) and a linear model (linear model)

booster [default = gbtree]: There are two models can be selected and gbtree gblinear. gbtree tree based model lifting calculation, gblinear lifting calculation using the linear model. The default value is gbtree

silent [default = 0]: indicates print out run-time information is set to 0, when run as a silent mode 1 taken not printed runtime information. The default value is 0

nthread: the number of threads XGBoost runtime. The default value is the maximum number of threads available in the current system

num_pbuffer: predicted buffer size is usually set to the number of training examples. Buffer for storing prediction last step to enhance, without human settings.

num_feature: Boosting used in the process of feature dimensions, the number of feature set. XGBoost automatically sets without manual settings.

4.2 Booster parameters (the Parameters booster ): It depends what kind of booster use.

(1)tree booster参数(booster=gbtree)

eta [default = 0.3]: In order to prevent over-fitting, shrinkage used in the update process steps. Right after each lifting calculation algorithm will directly get the new features of weight. eta cut right through the features of the heavy lifting calculation process more conservative. The default value is 0.3, in the range: [0, 1]. Typical values ​​are 0.01 to 0.2.

gamma [default = 0]: when a node split, only the value of the loss function declined after the split, will split the node. Gamma specifies the minimum loss function decline in the value of the desired node split. The larger the value of this parameter, the more conservative the algorithm.

Loss of function and value of this parameter is closely related to it need to be adjusted. In the range: [0, ∞]

max_depth [default = 6]: The maximum number of depth. The default is 6. In the range: [1, ∞]. CV function required to tune. Typical: 3-10

min_child_weight [default = 1]: the child nodes and the minimum sample weights. If the sample weight and the weight a leaf node is less than the resolution min_child_weight process ends. In the current regression model, this parameter is to establish a minimum number of samples required for each model. This parameter is used to avoid over-fitting. When its value is large, the model can avoid learning the special sample local. However, if this value is too high, it can lead to less fit. This parameter is required to adjust the CV. In the range: [0, ∞]

max_delta_step [default = 0]: We allow the value of weight per weight of the tree is estimated. If the value is set to 0, meaning no constraint; if it is set to a positive value, so that it can be updated more conservative step. This parameter is usually not necessary, but if the logistic regression class is extremely unbalanced this time he is likely to be helpful. It may be able to range between 1 and 10 control updates. In the range: [0, ∞]

subsample [default = 1]: a sub-sample training models ratio of the total set of samples. If set to 0.5 means XGBoost randomly drawn randomly from the whole of the sample set of 50% subsample build tree model, which can prevent over-fitting. In the range: (0, 1]

colsample_bytree [default = 1]: ratio establishing tree feature samples. The default is 1. In the range: (0, 1]

(2) Linear Booster parameters (booster = gblinear)

lambda [default = 0]: L2 regular penalty coefficient

alpha [default = 0]: penalty coefficient L1 canonical

lambda_bias: L2 on a regular offset. The default value is 0 (no regularization term bias on L1, because the bias is not important when L1)

4.3 Learning Objectives parameters (the Parameters Task ): Control scene study, for example, use different parameters to control sort regression problems.

Objective [default = REG: Linear ]: define a learning task and the corresponding learning objectives, alternative objective function: REG: Linear (linear regression); REG: Logistic (logistic regression); binary: Logistic (binary logic Regression the output probability); binary: logitraw (binary logistic regression, and the result is output wTx); cOUNT: poisson . (counts poisson regression problem, poisson distributed output is poisson return the default value of max_delta_step For 0.7 (Used to Safeguard Optimization)); multi: softmax (let XGBoost softmax using an objective function to handle multiple classification problems, and needs to set parameters num_class (category number)); multi: softprob (and softmax the same, but the output is ndata * nclass vector, the vector matrix can reshape ndata rows into columns not nclass row represents the probability of each class of data samples belongs);.. Rank: pairwise (SET XGBoost to do Task Ranking the pairwise by Minimizing Loss)

base_score [default = 0.5]: initializing all predicted scores instance, global bias; when there is a sufficient number of iterations, the changed value will not have much impact.

eval_metric [default according to objective]: Evaluation check data required, the objective function will have a different default evaluation (rmse for regression, and error for classification, mean average precision for ranking); multiple users can add species evaluation index list for the Python to pass parameters to the program, instead of the parameter list parameter map does not cover 'eval_metric', for selection: RMSE (root mean square error); Mae (mean absolute error); logloss (negative log likelihood function value); (area under the curve) auc; error (binary error rate (threshold of 0.5)); merror (multiple classification error rate); mlogloss (multiple classification logloss loss function)

seed [default = 0]: a random number seed. The default value is 0

4.4 xgboost.train () function parameters

xgboost.train(params,
              dtrain,
              num_boost_round=10,
              evals=(),
              obj=None,
              feval=None,
              maximize=False,
              early_stopping_rounds=None,
              evals_result=None,
              verbose_eval=True,
              learning_rates=None,
              xgb_model=None) 

params: This is a dictionary, which contains the key parameters and the corresponding values ​​of the training, in the form params = { 'booster': 'gbtree', 'eta': 0.1}

dtrain: Data Training

num_boost_round: This refers to the number of iterations to enhance the

evals: This is a list of the elements used to evaluate the list of the training process. Form evals = [(dtrain, 'train'), (dval, 'val')] or evals = [(dtrain, 'train')], for the first case, so that we can observe in the training process to verify the effect of the set.

obj: Objective Function Custom

feval: Custom evaluation function

maximize: whether to maximize the evaluation function

early_stopping_rounds: Early stop frequency is assumed to be 100, the iterative validation set error to a certain extent can not continue to decrease in the 100, the iteration is stopped. This requires evals There is at least one element, if there are multiple, according to the last one to do it. It returns the last number of iterations (not the best). If early_stopping_rounds exists, then the model will generate three attributes, bst.best_score, bst.best_iteration, and bst.best_ntree_limit

evals_result: dictionary, assess the results of elements stored in the watchlist.

verbose_eval (enter numeric or Boolean): in evals also requires at least one element. If True, the evaluation result of evals elements will result in the output; if the input number is assumed to be 5, 5 is output once every iteration.

learning_rates: a list of every upgrade of the learning rate,

xgb_model: Before training for xgb model loaded.

5.Xgboost combat:

XGBoost, there are two interfaces: XGBoost native interfaces and scikit-learn interface and XGBoost enables classification and regression both tasks



# 基于XGBoost原生接口的分类
from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score   # 准确率


iris = load_iris()  # 加载样本数据集
x_data, y_data = iris.data, iris.target
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,test_size=0.2, random_state=1234565)  # 数据集分割
params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',
    'num_class': 3,
    'gamma': 0.1,
    'max_depth': 6,
    'lambda': 2,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.1,
    'seed': 1000,
    'nthread': 4,
}
plst = params.items()
dtrain = xgb.DMatrix(x_train, y_train)  # 生成数据集格式
model = xgb.train(params,
                  dtrain,  # 训练的数据
                  num_boost_round=500  # 提升迭代的个数
                  ) # xgboost模型训练


# 对测试集进行预测
dtest = xgb.DMatrix(x_test)
y_pred = model.predict(dtest)

# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

# 显示重要特征
plot_importance(model)
plt.show()


# ================基于XGBoost原生接口的回归=============

import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error

# 加载数据集
boston = load_boston()
X,y = boston.data,boston.target

# XGBoost训练过程
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

params = {
    'booster': 'gbtree',
    'objective': 'reg:gamma',
    'gamma': 0.1,
    'max_depth': 5,
    'lambda': 3,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.1,
    'seed': 1000,
    'nthread': 4,
}

dtrain = xgb.DMatrix(X_train, y_train)
model = xgb.train(params, dtrain, num_boost_round=500)

# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
ans = model.predict(dtest)

# 显示重要特征
plot_importance(model)
plt.show()



# ==============基于Scikit-learn接口的分类================
from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载样本数据集
iris = load_iris()
X,y = iris.data,iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565) # 数据集分割

# 训练模型
model = xgb.XGBClassifier(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='multi:softmax')
model.fit(X_train, y_train)

# 对测试集进行预测
y_pred = model.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

# 显示重要特征
plot_importance(model)
plt.show()


# ================基于Scikit-learn接口的回归================
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston

boston = load_boston()
X,y = boston.data,boston.target

# XGBoost训练过程
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model = xgb.XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='reg:gamma')
model.fit(X_train, y_train)

# 对测试集进行预测
ans = model.predict(X_test)

# 显示重要特征
plot_importance(model)
plt.show()



General procedure 6. tuning parameters

And we will use similar methods in GBM. Required the following steps:

1. Select a higher rate of learning (learning rate). In general, the learning rate is 0.1. However, different problems, sometimes the ideal learning rate fluctuated between 0.05 and 0.3. Decision tree corresponding to this number over the selected learning rate. There XGBoost function "cv" a useful, this function can be used cross-validation at each iteration, and returns the number of tree ideal.

2. For a given rate and the number of decision tree learning, a decision tree for a particular parameter tuning (max_depth, min_child_weight, gamma, subsample, colsample_bytree). In determining a tree in the process, we can choose different parameters, later I will illustrate.

3. xgboost Regularization tuning parameters. (Lambda, alpha). These parameters can reduce the complexity of the model, thereby improving the performance of the model.

4. decreased learning rate, to determine the ideal parameters.

Let these operations step by step in detail together.

Step 1: Determine the number of estimated parameters tree_based learning rate and tuning.

In order to determine boosting parameters, we must first give an initial value of other parameters. We press a method Value:

1, max_depth = 5: This parameter value is preferably between 3-10. I chose the starting value is 5, but you can also choose other values. Start value between 4-6 is a good choice.

2, min_child_weight = 1: here chose a relatively small value, because it is a very unbalanced classification. Therefore, the value of some leaf node will be relatively small.

3, gamma = 0: start value may also be selected from other relatively small value, between 0.1 to 0.2 on it. This parameter is the successor to be adjusted.

4, subsample, colsample_bytree = 0.8: This is the most common of the initial value. Typical values ​​range between 0.5-0.9.

5, scale_pos_weight = 1: This value is very unbalanced because category.

Note Oh, values ​​of these parameters above is only an initial estimate, the subsequent need to be tuned. Here the learning rate will be set to the default of 0.1. Then xgboost cv function to determine the optimal number of decision trees.

Step Two: max_depth tuning parameters and min_weight

Let these two parameters tuning, because they have a great impact on the final result. First, let's wide range Coarse parameter, then a small scale fine-tuning.

Note: In this section I will search high load grid (grid search), this process takes about 15-30 minutes or even longer, depending on the performance of your system. You can also choose different values ​​depending on the performance of their own system.

The third step: gamma parameter Tuning

Step Four: Adjust the subsample parameters and colsample_bytree

Step Five: regularization parameter tuning.

Step 6: Reduce learning rate

Finally, we use a lower rate of learning, and the use of more decision tree. We can use XGBoost in CV function to work this step.

Guess you like

Origin blog.csdn.net/MG_ApinG/article/details/87934052