Summary XGBoost scikit-learn library using gradient boosting tree (GBDT) parameter adjustment Summary

    In XGBoost algorithm principle summary , we discussed the principle XGBoost algorithm, this one we discuss how to use XGBoost of Python libraries, as well as the significance and the assistant thinking some important parameters.

    In this paper, with reference to the Python documentation of XGBoost  and  parameters of the document XGBoost .

1. XGBoost library Overview

    XGBoost In addition to supporting Python, but also support R, Java and other languages. This article focuses on the XGBoost Python libraries, install and use "pip install xgboost" can now use the 0.90 version of XGBoost. XGBoost library in addition to supporting the decision tree as weak learners, it also supports linear classifiers, decision trees, and with DART DropOut, but under normal circumstances, we use the default decision tree can be weak learners, this article will only discuss the use of the default decision tree weak learners XGBoost.

    There are two kinds of interfaces XGBoost Python style. One is XGBoost own native Python API interfaces, API interface and the other is, to achieve both style sklearn are substantially the same, only slightly different API uses, mainly in the parameter names, and a data set the initialization above.

2. XGBoost basic use of libraries

    See my complete example Github codes .

2.1 using native Python API Interface

    XGBoost library interfaces of two kinds of style, we take a look at how a native Python API interface.

    XGBoost need to first native data set section, the output section from the input feature Press, and then put inside a DMatrix data structure, we do not care DMatrix the details of which, using our training set X and y can be initialized.

import pandas as pd
import numpy as np
import xgboost as xgb
import matplotlib.pylab as plt
%matplotlib inline

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.datasets.samples_generator Import make_classification
 # X-pattern feature, y is the output sample type, a total of 10,000 samples, wherein each sample 20, there are two categories of outputs, wherein there is no redundancy, a category of each cluster 
X, make_classification = Y (N_SAMPLES = 10000, n_features = 20 is, n_redundant = 0, 
                             n_clusters_per_class =. 1, n_classes = 2, flip_y = 0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
dtrain = xgb.DMatrix(X_train,y_train)
dtest = xgb.DMatrix(X_test,y_test)

    The above code, we set a random initialization binary data, and then divided into training and validation sets. Use the training and validation sets, respectively, to initialize a DMatrix, with DMatrix, you can do the training and forecasting. Simple example code is as follows:

param = {'max_depth':5, 'eta':0.5, 'verbosity':1, 'objective':'binary:logistic'}
raw_model = xgb.train(param, dtrain, num_boost_round=20)
from sklearn.metrics import accuracy_score
pred_train_raw = raw_model.predict(dtrain)
for i in range(len(pred_train_raw)):
    if pred_train_raw[i] > 0.5:
         pred_train_raw[i]=1
    else:
        pred_train_raw[i]=0               
print (accuracy_score(dtrain.get_label(), pred_train_raw))

    The accuracy of the training set my output here is 0.9664. Look at the performance validation sets:

pred_test_raw = raw_model.predict(dtest)
for i in range(len(pred_test_raw)):
    if pred_test_raw[i] > 0.5:
         pred_test_raw[i]=1
    else:
        pred_test_raw[i]=0               
print (accuracy_score(dtest.get_label(), pred_test_raw))

    I am here to verify the accuracy of the output is set 0.9408, already high.

     But for me used to the sklearn style API, or do not like native Python API interfaces, since there wrapper sklearn, and then try to use it sklearn style interface.

2.2 sklearn style interface using native parameter

    For sklearn style interface, there are two classes can be used, is a classification of XGBClassifier, the other is the return of used XGBRegressor. In the use of the use of these two categories, for the input parameters of the algorithm are also two ways, first is the original and still use the same API set of named parameters, the other is to use sklearn style parameter name. Here we take a look at how to use the original API and the same set of named parameters.

    In fact use XGBClassifier / ** kwargs XGBRegressor parameters of the above parameters params native set into them, as follows:

sklearn_model_raw = xgb.XGBClassifier(**param)
sklearn_model_raw.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="error",
        eval_set=[(X_test, y_test)])

    Inside the param it is actually 2.1 which defined:

param = {'max_depth':5, 'eta':0.5, 'verbosity':1, 'objective':'binary:logistic'}

    Use sklearn style interface, but use the original name of the parameter definition, still feel a little strange, so I generally still used to use another style interface sklearn style parameter name.

2.3 sklearn style interface style parameters using sklearn

    Use sklearn-style interface and uses sklearn style of argument, I recommend the way, mainly to do so and GBDT like sklearn library use is no different, you can also use the grid search sklearn.

    But to do so, naming and parameter definitions 2.1 and 2.2 a bit different. Specific parameters meaning behind us say, we look at a simple algorithm initialization process of classification, training and calls:

sklearn_model_new = xgb.XGBClassifier(max_depth=5,learning_rate= 0.5, verbosity=1, objective='binary:logistic',random_state=1)

    It can be seen directly on the parameters defined parameters XGBClassifier's class, and sklearn similar. Eta step we can see the previous two sections we define here become another name learning_rate.

    After initialization, training and methods to predict and 2.2 no difference.

sklearn_model_new.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="error",
        eval_set=[(X_test, y_test)])

3. XGBoost library parameters

    In the second section we have tried to use XGBoost library, but libraries for the parameters XGBoost and not too much discussion. Next we will discuss in detail here, mainly in sklearn style parameters in Section 2.3 mainly for discussion. These parameters before and I will speak scikit-learn gradient boosting tree (GBDT) Summary parameter adjustment parameters correspond to the definition of, so if everyone GBDT very familiar tone parameters, then the parameter adjustment XGBoost also mastered the 90%.

    Library XGBoost parameters include boosting the frame parameters, the weak learner and other parameters.

3.1 XGBoost framework parameters 

    For frame parameters XGBoost, the most important is the three parameters: booster, n_estimators and objectve.

    1) booster determines the type of the weak learner XGBoost used may be default gbtree, i.e. CART decision tree, and may be linear as well as the weak learner gblinear DART. In general, we use gbtree on it, you do not need to adjust parameters.

    2) n_estimators is very important to adjust the parameters, it is related to the complexity of our XGBoost model, because it represents the number of weak learners in our decision tree. This parameter corresponds to the n_estimators sklearn GBDT. n_estimators too small, easy underfitting, n_estimators too, and easy to overfitting, typically need to select a moderate adjustment parameter value.

    3) objective represents our problem to be solved is classification or regression, or other problems, and the corresponding loss of function. Specific values ​​can take a lot, we only care about the general parameters used in classification and regression time.

    Reg generally used in regression objective: squarederror, namely MSE mean square error. Binary classification general use binary: logistic, multi-classification problem in general use multi: softmax.

 3.2 XGBoost weak learner parameters   

    Here we only discuss the use of default parameters gbtree weak learners. To adjust the parameters of the main parameters of the relevant parameters of the decision tree is as follows: 

    1) max_depth: control the depth of the tree structure, data or less when small features can be whatever value. If the amount of the model sample, but also many features, it is necessary to limit the maximum depth value of a specific parameter adjustment generally grid search. This parameter corresponds to the max_depth sklearn GBDT.

    2) min_child_weight: the smallest child node weight threshold, if a tree node right weight is less than this threshold, it will not split sub-tree, that the tree node is a leaf node. Right tree node herein reuse is that the nodes of all the samples of the second derivative and, i.e. XGBoost principle articles inside $ H_ {tj} $: $$ H_ {tj} = \ sum \ limits_ {x_i \ in R_ {tj }} h_ {ti} $$

    The value needs to grid search to find the optimal value, in sklearn GBDT, there's no argument corresponds exactly, but min_samples_split played a threshold limit from another angle.

    3) gamma: XGBoost split decision tree loss caused by reduced threshold. That is when we try to split the tree structure, the maximum number of attempts the following formula: $$ \ max \ frac {1} {2} \ frac {G_L ^ 2} {H_L + \ lambda} + \ frac {1} {2 } \ frac {G_R ^ 2} {H_R + \ lambda} - \ frac {1} {2} \ frac {(G_L + G_R) ^ 2} {H_L + H_R + \ lambda} - \ gamma $$

    To maximize the value of this requires more than our gamma, to continue to divide sub-tree. This value is also required grid search to find the optimal value.

    4) subsample: sub-sampling parameters, this is sampling without replacement, and the role of sklearn GBDT subsample of the same. The proportions of the variance can be reduced less than 1, i.e., to prevent over-fitting, but will increase the sample fit deviation, the value can not be too low. The initial value can be 1, if over-fitting can find the assistant grid search to find a number of relatively small value.

    5) colsample_bytree / colsample_bylevel / colsample_bynode: These three parameters are characteristic for the sample, the default is not sampled, i.e., the decision tree using all of the features. colsample_bytree control sample wherein the ratio of the whole tree, colsample_bylevel control sample wherein a layer of proportion, proportion and a sampling control features colsample_bynode a tree node. A total characteristic such as 64, it is assumed colsample_bytree, colsample_bylevel and colsample_bynode are 0.5, one tree node is randomly sampled 8 wherein when an attempt to split the split sub-tree.

    6) reg_alpha / reg_lambda: This is a 2 XGBoost regularization parameter. reg_alpha is L1 regularization coefficient, reg_lambda is L1 regularization factor, in principle article where we discuss XGBoost regularization loss element part: $$ \ Omega (h_t) = \ gamma J + \ frac {\ lambda} {2} \ sum \ limits_ {j = 1} ^ Jw_ {tj} ^ 2 $$

    These parameters are needed to adjust the above parameters, but in general to emphasize max_depth, min_child_weight and gamma. If the situation had been found fit under, and then try to adjust several parameters later.

3.3 XGBoost other parameters

    XGBoost there are other parameters that need attention, mainly learning_rate.

    learning_rate control of each of the weak learners weight reduction factor, and sklearn GBDT of learning_rate similar, smaller learning_rate means that we need more number of iterations of weak learners. Usually we use the maximum number of iteration steps and work together to determine the effect of fitting algorithm. So these two parameters n_estimators and learning_rate to be effective with the assistant. Of course also be fixed to a learning_rate, then done n_estimators, then after completion of all other adjustment parameters, and finally, to transfer learning_rate n_estimators.

    Further, n_jobs control algorithm the number of concurrent threads, scale_pos_weight category unbalanced when the proportion of positive cases and negative cases for. sklearn in class_weight similar. importance_type you can check the degree of importance of each feature. Can choose to "gain", "weight", "cover", "total_gain" or "total_cover". The last feature may be obtained by the corresponding weights of the booster get_score method call. "Weight" by being selected as the feature division feature count to calculate importance, "gain" and "total_gain" respectively, calculated to bring the feature to be chosen as the average gain characteristics and split the total gain is calculated by importance. "Cover" and "total_cover" is selected and the overall average sample coverage of sample coverage when computing features to make split is calculated by importance.

4. XGBoost grid search parameter adjustment

    XGBoost can use the grid search and class GridSeachCV sklearn combination of parameter adjustment, and the use of ordinary sklearn classification and regression algorithm is no different. An example of specific process is as follows:

gsCv = GridSearchCV(sklearn_model_new,
                   {'max_depth': [4,5,6],
                    'n_estimators': [5,10,20]})
gsCv.fit(X_train,y_train)
print(gsCv.best_score_)
print(gsCv.best_params_)

    My output here is:

    0.9533333333333334

    {'max_depth': 4, 'n_estimators': 10}

    Then try the search on the basis of the above raised learning_rate:

sklearn_model_new2 = xgb.XGBClassifier(max_depth=4,n_estimators=10,verbosity=1, objective='binary:logistic',random_state=1)
gsCv2 = GridSearchCV(sklearn_model_new2, 
                   {'learning_rate ': [0.3,0.5,0.7]})
gsCv2.fit(X_train,y_train)
print(gsCv2.best_score_)
print(gsCv2.best_params_)

    My output here is:

    0.9516

    {'learning_rate ': 0.3}

    Of course, the actual situation here need to continue to adjust parameters, assuming we've parameter adjustment is completed, we try to look at the effect of using the validation set:

sklearn_model_new2 = xgb.XGBClassifier(max_depth=4,learning_rate= 0.3, verbosity=1, objective='binary:logistic',n_estimators=10)
sklearn_model_new2.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="error",
        eval_set=[(X_test, y_test)])

    The final output is:

    [9]	validation_0-error:0.0588

    That is, the accuracy of the validation set is 94.12%.

    We can determine our previous grid search parameter adjustment if yielded results verified the accuracy of the set. When the actual processing of the need to repeatedly search parameters and validation.

    The above is a summary of library use XGBoost, and the desire to help my friends to use XGBoost to solve practical problems.

 

(Welcome to reprint, please indicate the source welcomed the communication:. [email protected])  

Guess you like

Origin www.cnblogs.com/pinard/p/11114748.html