XGBoost parameter explanation

foreword

The parameter description of XGBoost in this article is partially translated, link to the original text . Therefore, only some key parameters are translated in this article, and due to my limited ability, there are inevitably mistakes in the article, and I hope to correct them. The following is a rough translation.

Before running XGboost, we must set three types of parameters: general parameters, booster parameters and learning target parameters (task parameters)
• The general parameters determine which Booster we choose for Boosting, Usually linear and tree models, that is, used for macro function control
• Booster parameters depend on the selected Booster type and are used to control the booster at each step
• Learning task parameters determine the learning strategy. For example, regression tasks can use different parameters and sorting tasks
Command line parameters depend on the CLI version of xgboost

General Parameters

  1. booster [default=gbtree]
    determines which booster to use, can be gbtree, gblinear or dart. gbtree and dart use tree-based models, while gblinear uses linear functions.
  2. silent [default=0]
    Set to 0 to print running information; set to 1 for silent mode, do not print
  3. nthread [default value = set to the maximum possible number of threads] the number
    of threads running xgboost in parallel, the input parameters should be <= the number of CPU cores of the system, if not set the algorithm will detect and set it to the number of all cores of the CPU
    below The two parameters do not need to be set, just use the default ones

  4. num_pbuffer [Automatically set by xgboost, no user setting required]
    The size of the prediction result buffer, usually set to the number of training instances. This cache is used to save the prediction result of the last boosting operation.

  5. num_feature [Automatically set by xgboost, no user setting required]
    Use the dimension of the feature in boosting, set to the maximum dimension of the feature

Parameters for Tree Booster

  1. eta [default=0.3, alias: learning_rate]
    Reduced step size in updates to prevent overfitting. After each boosting, new feature weights can be obtained directly, which can make boosting more robust.
    Range: [0,1]
  2. gamma [default=0, alias: min_split_loss] (minimum split loss)
    When a node is split, the node will be split only if the value of the loss function decreases after the split. Gamma specifies the minimum drop in loss function required for node splitting. The larger the value of this parameter, the more conservative the algorithm will be. The value of this parameter is closely related to the loss function, so it needs to be adjusted.
    Range: [0,∞]
  3. max_depth [default=6]
    This value is the maximum depth of the tree. This value is also used to avoid overfitting. The larger the max_depth, the more specific and local samples the model will learn. Set to 0 for no limit
    Range : [0,∞]
  4. min_child_weight [default=1]
    Determines the minimum leaf node sample weight sum. This parameter of XGBoost is the sum of the minimum sample weights, while the GBM parameter is the minimum total number of samples. This parameter is used to avoid overfitting. When its value is large, it can prevent the model from learning local special samples. But if this value is too high, it will lead to underfitting. This parameter needs to be adjusted using CV. .range
    : [0,∞]
  5. subsample [default=1]
    This parameter controls the proportion of random sampling for each tree. Decreasing the value of this parameter makes the algorithm more conservative and avoids overfitting. However, if this value is set too small, it can lead to underfitting. Typical value: 0.5-1, 0.5 means average sampling to prevent overfitting.
    Range: (0,1]
  6. colsample_bytree [default=1] is
    used to control the proportion of the number of columns randomly sampled per tree (each column is a feature). Typical value: 0.5-1
    Range: (0,1]
  7. colsample_bylevel [default=1] is
    used to control the proportion of the sampling of the number of columns for each split of each level of the tree. I personally don't use this parameter very much, because the subsample parameter and the colsample_bytree parameter can play the same role. But if you are interested, you can dig more usefulness of this parameter.
    Range: (0,1]
  8. lambda [default=1, alias: reg_lambda]
    L2 regularizer for weights. (Similar to Ridge regression). This parameter is used to control the regularization part of XGBoost. Although most data scientists rarely use this parameter, this parameter can still be used to reduce overfitting. .
  9. alpha [default=0, alias: reg_alpha]
    L1 regularization term for weights. (Similar to Lasso regression). It can be applied to very high dimensions, making the algorithm faster.
  10. scale_pos_weight [default=1]
    When the samples of each category are very unbalanced, setting this parameter to a positive value can make the algorithm converge faster. It can usually be set as the ratio of the number of negative samples to the number of positive samples.

Parameters for Linear Booster

  1. lambda [default=0, alias: reg_lambda]
    L2 regularization penalty coefficient. Increasing this value will make the model more conservative.
  2. alpha [default=0, alias: reg_alpha]
    L2 regularization penalty coefficient, increasing this value will make the model more conservative.
  3. lambda_bias [default=0, alias: reg_lambda_bias]
    L2 regularization on bias (no bias on L1 because it doesn't matter)

Learning Task Parameters

  1. objective [default=reg:linear]
    " reg:linear " – linear regression
    reg:logistic ” – logistic regression
    binary:logistic ” – binary logistic regression, output as probability
    binary:logitraw ” – binary logistic regression , the output is wTx
    " count:poisson " - poisson regression of counting problem, the output is poisson distribution. In poisson regression, the default value of max_delta_step is 0.7 (used to safeguard optimization)
    " multi:softmax " - to set XGBoost to use the softmax objective function for multi-classification, you need to set the parameter num_class (the number of categories)
    " multi:softprob " - same as softmax, but the output is a vector of ndata*nclass, where the value is the probability that each data is divided into each class.
  2. eval_metric [default=choose by objective function] The options
    available are as follows:
    " rmse ": root mean square error
    " mae ": mean absolute value error
    " logloss ": negative log-likelihood
    " error ": binary classification error Rate. Its value is obtained by the ratio of the number of misclassifications to the total number of classifications. For predictions, predicted values ​​greater than 0.5 are considered positive, and others are classified as negative.
    " error@t ": different division thresholds can be set by 't'
    " merror ": multi-classification error rate, the formula is (wrong cases)/(all cases)
    " mlogloss ": multi-class log loss
    " auc ": Area under the curve
    " ndcg ": Normalized Discounted Cumulative Gain
    " map ": Average correct rate
  3. seed [default=0]
    The seed of the random number is set, which can reproduce the result of random data, and can also be used to adjust the parameters

Command Line Parameters

The following parameters are only used in the command line version of XGBoost.
1. use_buffer [default=1]
Whether to create a binary buffer for text input. Doing so will speed up loading.
2. The number of iterations of num_round boosting. 3. data path for training data 4. test:data is used as a test data path for prediction 5. save_period [default=0] period to save the parametric model 6. task [default=train] Options: train, pred , eval, dump train: use data to train pred: use test:data for prediction eval: perform evaluation statistics, specified by eval[name]=filename dump: export the learned model to text format 7. model_in [default=NULL ] The path to the input model to be used for test, eval, dump. During training, XGBoost will continue training based on the input model. 8. model_out














[default=NULL]
The path to save the model after training is complete. If not specified will use such as 0003.model, where 0003
is the number of boosting iterations name_dump [default=dump.txt] The name of the model dump file 12. name_pred [default=pred.txt] The name of the prediction file to be used for the pred model 13. pred_margin [default=0] predict margin while not transition probability








Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325765421&siteId=291194637