Tuning guide for the 10 most commonly used hyperparameters in XGBoost

b42bb5034cfb4c1cd4f1579349197d43.jpeg

This article will explain in detail the introduction, function and value range of the ten most commonly used hyperparameters in XGBoost, and how to use Optuna for hyperparameter tuning.

For XGBoost, the default hyperparameters can work normally, but if you want to get the best results, you need to adjust some hyperparameters to match your data. The following parameters are very important for XGBoost:

  • eta

  • num_boost_round

  • max_depth

  • subsample

  • colsample_bytree

  • gamma

  • min_child_weight

  • lambda

  • alpha

The API of XGBoost has two calling methods, one is our common native API, and the other is an API compatible with Scikit-learn API. Scikit-learn API is seamlessly integrated with the Sklearn ecosystem. We only focus on native APIs here (that is, our most common ones), but here is a list to help you compare the two API parameters, in case they are used in the future:

47de23b451fc60c08a413f15cb3fff7b.jpeg

If you want to use hyperparameter tuning tools other than Optuna, you can refer to this table. The following diagram shows the interaction between these parameter pairs:

2a61a23d07a459b0fecdde707434949a.jpeg

These relationships are not fixed, but roughly the situation is like the above picture, because there are some other parameters that may have additional effects on our or 10 parameters.

1、objective

This is the training objective of our model

01e16a0bb1e3f42f12f9d366a4b97614.jpeg

The simplest explanation is that this parameter specifies what our model is going to do, which is to affect the type of decision tree and the loss function.

2、num_boost_round - n_estimators

num_boost_round specifies the number of decision trees (often called base learners in XGBoost) that are determined to be generated during training. The default is 100, but that's far from enough for today's large datasets.

Increasing the parameters can generate more trees, but as the model becomes more complex, the chance of overfitting increases significantly.

A trick learned from Kaggle is to set a high value for num_boost_round, say 100,000, and use early stopping to get the best version.

In each boosting round, XGBoost generates more decision trees to improve the overall score of the previous decision tree. That's why it's called boost. This process continues until num_boost_round polls, whether or not there is an improvement over the previous round.

But by using early stopping technique, we can stop training when the validation metric is not improving, which not only saves time, but also prevents overfitting

With this trick, we don't even need to tune num_boost_round. Here's what it looks like in code:

# Define the rest of the params
 params = {...}
 
 # Build the train/validation sets
 dtrain_final = xgb.DMatrix(X_train, label=y_train)
 dvalid_final = xgb.DMatrix(X_valid, label=y_valid)
 
 bst_final = xgb.train(
    params,
    dtrain_final,
    num_boost_round=100000 # Set a high number
    evals=[(dvalid_final, "validation")],
    early_stopping_rounds=50, # Enable early stopping
    verbose_eval=False,
 )

The code above makes XGBoost generate 100k decision trees, but due to the use of early stopping, it will stop when the validation score has not improved in the last 50 epochs. In general, the number of trees can range from 5000-10000. Controlling num_boost_round is also one of the biggest factors affecting the runtime of the training process, since more trees require more resources.

3、eta - learning_rate

In each round, all existing trees return a prediction for a given input. For example, five trees might return the following predictions for sample N:

Tree 1: 0.57   Tree 2: 0.9   Tree 3: 4.25   Tree 4: 6.4   Tree 5: 2.1

These outputs need to be aggregated in order to return the final predictions, but not before XGBoost shrinks or scales them using a parameter called eta or learning rate. The final output after scaling is:

output = eta * (0.57 + 0.9 + 4.25 + 6.4 + 2.1)

A large learning rate gives more weight to the contribution of each tree in the ensemble, but this can lead to overfitting/instability and speed up training time. Whereas a lower learning rate suppresses the contribution of each tree, making the learning process slower but more robust. This regularizing effect of the learning rate parameter is especially useful for complex and noisy datasets.

The learning rate is inversely proportional to other parameters like num_boost_round, max_depth, subsample and colsample_bytree. Lower learning rates require higher values ​​of these parameters and vice versa. But in general you don't need to worry about the interaction between these parameters, because we will use auto-tuning to find the best combination.

4、subsample和colsample_bytree

Subsampling introduces more randomness into training, which helps combat overfitting.

Subsample = 0.7 means that each decision tree in the ensemble will be trained on a random selection of 70% of the available data. A value of 1.0 means that all rows will be used (no subsampling).

Similar to subsample, there is also colsample_bytree. As the name suggests, colsample_bytree controls the proportion of features each decision tree will use. Colsample_bytree=0.8 makes each tree use a random 80% of the available features (columns) in each tree.

Adjusting these two parameters can control the trade-off between bias and variance. Using smaller values ​​reduces the correlation between trees, increases diversity in the ensemble, helps improve generalization and reduces overfitting.

But they may introduce more noise, increasing the bias of the model. Whereas using larger values ​​increases the correlation between trees, reduces diversity and may lead to overfitting.

5、max_depth

The maximum depth max_depth controls the maximum number of layers a decision tree may reach during training.

9bb89e0a0677faf5e599b9cea703f07d.png

Deeper trees can capture more complex interactions between features. But deeper trees also have a higher risk of overfitting because they can remember noisy or irrelevant patterns in the training data. To control this complexity, max_depth can be limited, resulting in shallower, simpler trees and capturing more general patterns.

The Max_depth value can be a good balance between complexity and generalization.

6、7、alpha,lambda

These two parameters are said together because alpha (L1) and lambda (L2) are two regularization parameters that help overfitting.

The difference from other regularization parameters is that they can shrink the weight of unimportant or unimportant features to 0 (especially alpha), resulting in a model with fewer features and thus reduced complexity.

The effect of alpha and lambda may be affected by other parameters such as max_depth, subsample and colsample_bytree. Higher alpha or lambda values ​​may require tuning other parameters to compensate for the increased regularization. For example, higher alpha values ​​may benefit from larger subsample values, as this preserves model diversity and prevents underfitting.

8、gamma

If you read the XGBoost documentation, it says that gamma is:

The minimum loss reduction required for further partitioning on the leaf nodes of the tree.

英文原文:the minimum loss reduction required to make a further partition on a leaf node of the tree.

I don't think anyone can understand it except for the person who wrote it. Let's see what it actually is, here is a two-level decision tree:

4ced5fd0bedffa7bcaa95e070d3e624c.png

To justify adding more layers to the tree by splitting leaf nodes, XGBoost should calculate that this operation significantly reduces the loss function.

But "significantly how much?" This is gamma - it acts as a threshold to decide whether a leaf node should be further split.

If the reduction in the loss function (often called the gain) after the potential split is less than the chosen gamma, no split is performed. This means that the leaf nodes will remain unchanged and the tree will not grow from that point on.

So the goal of tuning is to find the best split that results in the largest reduction in the loss function, which means improved model performance.

9、min_child_weight

XGBoost starts the initial training process with a single decision tree with a single root node. This node contains all training instances (rows). Then deeper nodes will contain fewer and fewer instances as XGBoost chooses latent features and segmentation criteria to minimize the loss.

2752e104ddf246bdd61abb19701eca84.png

If XGBoost is left to run arbitrarily, the tree may grow to the point where there are only a few insignificant instances in the final nodes. This situation is highly undesirable, as this is the very definition of overfitting.

So XGBoost sets a threshold for the minimum number of instances in each node to continue splitting. By weighting all instances in a node, and finding the sum of the weights, if this final weight is less than min_child_weight, splitting stops and the node becomes a leaf node.

The above explanation is the most simplified version of the whole process, because we mainly introduce his concepts.

Summarize

The above is our explanation of these 10 important hyperparameters, there are still many things to learn if you want to understand more deeply. So it is recommended to give ChatGPT the following two tips:

1) Explain the {parameter_name} XGBoost parameter in detail and how to choose values for it wisely.
 
 2) Describe how {parameter_name} fits into the step-by-step tree-building process of XGBoost.

It must be clearer than I can explain, right.

Finally, if you also use optuna for tuning, please refer to the following GIST:

https://gist.github.com/BexTuychiev/823df08d2e3760538e9b931d38439a68

By: Bex T.

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 Summary

Talking about the difference between algorithm post and development post

Internet school recruitment research and development salary summary

The 2022 Internet job hunting status, gold 9 silver 10 will soon become copper 9 iron 10! !

Public number: AI snail car

Stay humble, stay disciplined, keep improving

fb1bbfa539bfe39058cefd3797c8beda.jpeg

Send [Snail] to get a copy of "Hands-on AI Project" (written by AI Snail Car)

Send [1222] to get a good leetcode brushing notes

Send [AI Four Classics] Get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/132200330