"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.
Integrated learning: comparison and parameter adjustment
Although deep learning is popular now, algorithms represented by XGBoost
, LightGBM
, and are still widely used. Regardless of unstructured data applications such as text, images, voice, and video, which are suitable for deep learning, algorithms are still the first choice for structured data fields with fewer training samples .CatBoost
Boosting
Boosting
1 Comparison of three major Boosting algorithms
XGBoost
, LightGBM
and CatBoost
are the current classic SOTA
( state of the art
) Boosting algorithms. These three models are all integrated learning frameworks supported by decision trees. Among them, XGBoost is an improvement to the original version of the GBDT algorithm, and and have been further LightGBM
optimized on the basis CatBoost
. XGBoost
Each has its strengths in accuracy and speed.
There are two main differences between the three Boosting algorithms:
First, the construction method of the model tree is different, XGBoost
using the layer-wise growth ( level-wise
) decision tree construction strategy, LightGBM uses the leaf-wise growth (leaf-wise) construction strategy, and CatBoost uses the symmetric tree structure (oblivious-tree) , whose decision trees are all complete binary trees.
Second, there is a big difference in the processing of categorical features. It does not have the ability to automatically process categorical features. For categorical features in the data, we need to manually convert them into values before inputting XGBoost
them into the model; LightGBM
Name, the algorithm will be automatically aligned for processing; CatBoost
it is famous for processing categorical features, and can also efficiently process categorical features through feature encoding methods such as target variable statistics.
1.1 Data preprocessing
Let's take the Kaggle
2015 data set as an example, and experiment with the , and models flights
respectively : flights data setXGBoost
LightGBM
CatBoost
The data set has more than 5 million flight records and 31 features. We took sampling to extract 1% of the data from the original data set, and screened 11 features for demonstration. After preprocessing, we rebuilt the training set. The goal was to build a binary classification model for whether the flight was delayed.
import pandas as pd
from sklearn.model_selection import train_test_split
flights = pd.read_csv('flights.csv')
flights = flights.sample(frac=0.01, random_state=10) # 数据集抽样1%
flights
flights = flights[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
"ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]] # 选11个特征
flights= flights.reset_index(drop=True)
flights
flights["ARRIVAL_DELAY"] = (flights["ARRIVAL_DELAY"]>10)*1 # 延误超过10分钟看作是延误,bool类型转换为int类型
flights["ARRIVAL_DELAY"].unique()
array([0, 1], dtype=int32)
cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"] # 类别特征
for col in cols:
print(len(flights[col].unique()))
14
6163
633
644
for item in cols:
flights[item] = flights[item].astype("category").cat.codes
X_train, X_test, y_train, y_test = train_test_split(flights.drop(["ARRIVAL_DELAY"], axis=1), flights["ARRIVAL_DELAY"], random_state=10, test_size=0.3) # 划分数据集
1.2 Test of XGBoost on flights dataset
from sklearn.metrics import roc_auc_score
import xgboost as xgb
import time
# 设置模型参数
params = {
'booster': 'gbtree', # 基于树
'objective': 'binary:logistic',
'gamma': 0.1, # 剪枝中用到的最小损失下降值
'max_depth': 8,
'lambda': 2,
'subsample': 0.7, # 表示用于训练的样本比例
'colsample_bytree': 0.7, # 表示用于训练的特征比例
'min_child_weight': 3, # 一个叶子节点的最小权重
'eta': 0.001,# 学习速率
'seed': 1000,
'nthread': 4, # 线程数量,用于并行计算
}
# 训练
t0 = time.time()
num_rounds = 500 # 表示训练轮数,即树的个数
dtrain = xgb.DMatrix(X_train, y_train)
model_xgb = xgb.train(params, dtrain, num_rounds)
print('training spend {} seconds'.format(time.time()-t0))
# 测试
t1 = time.time()
dtest = xgb.DMatrix(X_test)
y_pred = model_xgb.predict(dtest)
print('testing spend {} seconds'.format(time.time()-t1))
y_pred_train = model_xgb.predict(dtrain)
print(f"训练集auc:{
roc_auc_score(y_train, y_pred_train)}")
print(f"测试集auc:{
roc_auc_score(y_test, y_pred)}")
training spend 6.46193265914917 seconds
testing spend 0.057062625885009766 seconds
训练集auc:0.752049936354563
测试集auc:0.6965194979943091
1.3 Test of LightGBM on flights dataset
import lightgbm as lgb
d_train = lgb.Dataset(X_train, label=y_train)
# 设置模型参数
params = {
"max_depth": 5,
"learning_rate" : 0.05,
"num_leaves": 500,
"n_estimators": 300
}
cate_features_name = ["MONTH","DAY","DAY_OF_WEEK","AIRLINE","DESTINATION_AIRPORT", "ORIGIN_AIRPORT"] # 类别特征
t0 = time.time()
model_lgb = lgb.train(params, d_train, categorical_feature = cate_features_name)
print('training spend {} seconds'.format(time.time()-t0))
t1 = time.time()
y_pred = model_lgb.predict(X_test)
print('testing spend {} seconds'.format(time.time()-t1))
y_pred_train = model_lgb.predict(X_train)
print(f"训练集auc:{
roc_auc_score(y_train, y_pred_train)}")
print(f"测试集auc:{
roc_auc_score(y_test, y_pred)}")
training spend 0.43550848960876465 seconds
testing spend 0.0230252742767334 seconds
训练集auc:0.8867447004324996
测试集auc:0.7033506245405025
1.2 Test of CatBoost on the flights dataset
import catboost as cb
cat_features_index = [0,1,2,3,4,5,6]
t0 = time.time()
model_cb = cb.CatBoostClassifier(eval_metric="AUC", one_hot_max_size=50, depth=6, iterations=300, l2_leaf_reg=1, learning_rate=0.1)
model_cb.fit(X_train,y_train, cat_features= cat_features_index)
print('training spend {} seconds'.format(time.time()-t0))
t1 = time.time()
y_pred = model_cb.predict(X_test)
print('testing spend {} seconds'.format(time.time()-t1))
y_pred_train = model_cb.predict(X_train)
print(f"训练集auc:{
roc_auc_score(y_train, y_pred_train)}")
print(f"测试集auc:{
roc_auc_score(y_test, y_pred)}")
training spend 20.633496284484863 seconds
testing spend 0.03611636161804199 seconds
训练集auc:0.5670692017560355
测试集auc:0.5473357824098838
From the above experimental results, it can be seen that without further data feature engineering and hyperparameter tuning, on this data set, both in terms of LightGBM
accuracy and speed, it is better than XGBoost
and CatBoost
, and CatBoost
the performance is the worst.
2 Common hyperparameter tuning methods
We call the parameters obtained without model training 超参数
( hyperparameter
), and the commonly used parameter tuning methods in machine learning include 网格搜索法
( grid search
), 随机搜索法
( random search
) and 贝叶斯优化
( bayesian optimization
).
2.1 Grid search method
The grid search method is a commonly used hyperparameter tuning method, which is often used to optimize three or fewer hyperparameters, and is essentially an exhaustive method. For each hyperparameter, the user selects a smaller finite set to search, and then performs Cartesian product of these hyperparameters to obtain several sets of hyperparameters. Grid search uses each set of hyperparameters to train the model, and selects the hyperparameter with the smallest error in the validation set as the optimal hyperparameter.
sklearn
The grid search parameter adjustment is realized through model_selection
the module GridSearchCV
, and the parameter adjustment process is cross-validated. The following shows an example of XGBoost's grid search:
# 基于XGBoost的GridSearch搜索范例
from sklearn.model_selection import GridSearchCV
model = xgb.XGBClassifier()
# 待搜索的参数列表实例
params_lst = {
'max_depth': [3,5,7],
'min_child_weight': [1,3,6],
'n_estimators': [100,200,300],
'learning_rate': [0.01, 0.05, 0.1]
}
# verbose:表示日志输出的详细程度
# n_jobs:表示并行计算的数量,即同时运行的任务数,-1 表示使用所有可用的 CPU 进行并行计算
t0 = time.time()
grid_search = GridSearchCV(model, param_grid=params_lst, cv=3, verbose=10, n_jobs=-1)
grid_search.fit(X_train, y_train)
print('gridsearch for xgb spend', time.time()-t0, 'seconds.')
print(grid_search.best_params_)
gridsearch for xgb spend 67.24716424942017 seconds.
{
'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 6, 'n_estimators': 300}
2.2 Random Search
Random search is to search for optimal hyperparameters randomly within the specified hyperparameter range or distribution. Compared with the grid search method, given a hyperparameter distribution, not all hyperparameters are tried, but a fixed number of parameters are sampled from the given distribution, and only these sampled hyperparameters are actually tested.
sklearn
Random search is performed through the methods under the model_selection module RandomizedSearchCV
.
# 基于XGBoost的RandomizedSearch搜索范例
from sklearn.model_selection import RandomizedSearchCV # 通过 n_iter 进行手动设置或是自动根据参数空间大小确定采样次数
model = xgb.XGBClassifier()
# 待搜索的参数列表实例
params_lst = {
'max_depth': [3,5,7],
'min_child_weight': [1,3,6],
'n_estimators': [100,200,300],
'learning_rate': [0.01, 0.05, 0.1]
}
t0 = time.time()
random_search = RandomizedSearchCV(model, params_lst, random_state=0)
random_search.fit(X_train, y_train)
print(random_search.best_params_)
print('randomsearch for xgb spend', time.time()-t0, 'seconds.')
{
'n_estimators': 300, 'min_child_weight': 6, 'max_depth': 5, 'learning_rate': 0.1}
randomsearch for xgb spend 60.41917014122009 seconds.
2.3 Bayesian parameter tuning
Bayesian Optimization is a hyperparameter optimization method based on Bayesian theorem, which is mainly used to optimize black-box functions. Compared with traditional grid search or random search, Bayesian parameter tuning builds a model based on the uncertainty of sample observations, calculates the posterior distribution through Bayesian formula, and selects hyperparameters with relatively high expectations for optimization.
The key idea of Bayesian parameter tuning is to use sample observations to continuously update the priors (premise assumptions) of the model, and predict the results of the next experiment based on the known premises. Through the predicted results, the posterior distribution of the model is updated, that is, the probability density function of the hyperparameter distribution after the samples are known. Based on the statistical metrics of the hyperparameter probability density function, such as expectation and variance, the next set of hyperparameter values is generated until an optimal result is reached. In the whole process, the next set of hyperparameters to be tested is based on the current existing data, reversely calculates the hyperparameter probability distribution of the existing samples, and searches according to the expected value (or mode) of the hyperparameters, To a certain extent, it makes the search more "smart" and efficient.
Before performing Bayesian optimization, we need to define an objective function to be optimized based on XGBoost's cross-validation xgb.cv, obtain the cross-validation verification results of xgb.cv, and use the test set AUC as the precision measure for optimization. Finally, the defined target optimization function and hyperparameter search range are passed into the Bayesian optimization function, and the Bayesian optimization can be performed given the initialization point and the number of iterations.
# 定义相关参数
num_rounds = 3000
params = {
'eta': 0.1,
'silent': 1,
'eval_metric': 'auc',
'verbose_eval': True,
'seed': 2023
}
# 定义目标优化函数
def xgb_evaluate(min_child_weight, colsample_bytree, max_depth, subsample, gamma,alpha):
params['min_child_weight'] = int(min_child_weight)
params['cosample_bytree'] = max(min(colsample_bytree, 1), 0)
params['max_depth'] = int(max_depth)
params['subsample'] = max(min(subsample, 1), 0)
params['gamma'] = max(gamma, 0)
params['alpha'] = max(alpha, 0)
cv_result = xgb.cv(params, dtrain, num_boost_round=num_rounds, nfold=5, seed=2023, callbacks=[xgb.callback.EarlyStopping(rounds=50)])
return cv_result['test-auc-mean'].values[-1]
# pip install bayesian-optimization
from bayes_opt import BayesianOptimization
#-*-coding:utf-8-*-
num_iter = 25
init_points = 5
t0 = time.time()
xgbBO = BayesianOptimization(xgb_evaluate, {
'min_child_weight': (1, 20),
'colsample_bytree': (0.1, 1),
'max_depth': (5, 15),
'subsample': (0.5, 1),
'gamma': (0, 10),
'alpha': (0, 10),
})
xgbBO.maximize(init_points=init_points, n_iter=num_iter)
print('bayesianSearch for xgb spend', time.time()-t0, 'seconds.')
| iter | target | alpha | colsam... | gamma | max_depth | min_ch... | subsample |
| 1 | 0.7193 | 7.155 | 0.5762 | 0.9578 | 9.488 | 13.18 | 0.9343 |
bayesianSearch for xgb spend 765.5525000095367 seconds.
It can be seen that the parameters obtained after Bayesian tuning make the auc value of xgboost on the flights test set reach 0.72.