"Machine Learning Formula Derivation and Code Implementation" chapter16-integrated learning comparison and parameter adjustment

"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.

Integrated learning: comparison and parameter adjustment

Although deep learning is popular now, algorithms represented by XGBoost, LightGBM, and are still widely used. Regardless of unstructured data applications such as text, images, voice, and video, which are suitable for deep learning, algorithms are still the first choice for structured data fields with fewer training samples .CatBoostBoostingBoosting

1 Comparison of three major Boosting algorithms

XGBoost, LightGBMand CatBoostare the current classic SOTA( state of the art) Boosting algorithms. These three models are all integrated learning frameworks supported by decision trees. Among them, XGBoost is an improvement to the original version of the GBDT algorithm, and and have been further LightGBMoptimized on the basis CatBoost. XGBoostEach has its strengths in accuracy and speed.

There are two main differences between the three Boosting algorithms:

First, the construction method of the model tree is different, XGBoostusing the layer-wise growth ( level-wise) decision tree construction strategy, LightGBM uses the leaf-wise growth (leaf-wise) construction strategy, and CatBoost uses the symmetric tree structure (oblivious-tree) , whose decision trees are all complete binary trees.

Second, there is a big difference in the processing of categorical features. It does not have the ability to automatically process categorical features. For categorical features in the data, we need to manually convert them into values ​​before inputting XGBoostthem into the model; LightGBMName, the algorithm will be automatically aligned for processing; CatBoostit is famous for processing categorical features, and can also efficiently process categorical features through feature encoding methods such as target variable statistics.

1.1 Data preprocessing

Let's take the Kaggle2015 data set as an example, and experiment with the , and models flightsrespectively : flights data setXGBoostLightGBMCatBoost

The data set has more than 5 million flight records and 31 features. We took sampling to extract 1% of the data from the original data set, and screened 11 features for demonstration. After preprocessing, we rebuilt the training set. The goal was to build a binary classification model for whether the flight was delayed.

import pandas as pd
from sklearn.model_selection import train_test_split
flights = pd.read_csv('flights.csv')
flights = flights.sample(frac=0.01, random_state=10) # 数据集抽样1%
flights

insert image description here

flights = flights[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]] # 选11个特征
flights= flights.reset_index(drop=True)
flights

insert image description here

flights["ARRIVAL_DELAY"] = (flights["ARRIVAL_DELAY"]>10)*1 # 延误超过10分钟看作是延误,bool类型转换为int类型
flights["ARRIVAL_DELAY"].unique()
array([0, 1], dtype=int32)
cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"] # 类别特征
for col in cols:
    print(len(flights[col].unique()))
14
6163
633
644
for item in cols:
    flights[item] = flights[item].astype("category").cat.codes

X_train, X_test, y_train, y_test = train_test_split(flights.drop(["ARRIVAL_DELAY"], axis=1), flights["ARRIVAL_DELAY"], random_state=10, test_size=0.3) # 划分数据集

1.2 Test of XGBoost on flights dataset

from sklearn.metrics import roc_auc_score
import xgboost as xgb
import time

# 设置模型参数
params = {
    
    
    'booster': 'gbtree', # 基于树
    'objective': 'binary:logistic',   
    'gamma': 0.1, # 剪枝中用到的最小损失下降值
    'max_depth': 8,
    'lambda': 2,
    'subsample': 0.7, # 表示用于训练的样本比例
    'colsample_bytree': 0.7, # 表示用于训练的特征比例
    'min_child_weight': 3, # 一个叶子节点的最小权重
    'eta': 0.001,# 学习速率
    'seed': 1000,
    'nthread': 4, # 线程数量,用于并行计算
}

# 训练
t0 = time.time()
num_rounds = 500 # 表示训练轮数,即树的个数
dtrain = xgb.DMatrix(X_train, y_train)
model_xgb = xgb.train(params, dtrain, num_rounds)
print('training spend {} seconds'.format(time.time()-t0))

# 测试
t1 = time.time()
dtest = xgb.DMatrix(X_test)
y_pred = model_xgb.predict(dtest)
print('testing spend {} seconds'.format(time.time()-t1))
y_pred_train = model_xgb.predict(dtrain)
print(f"训练集auc:{
      
      roc_auc_score(y_train, y_pred_train)}")
print(f"测试集auc:{
      
      roc_auc_score(y_test, y_pred)}")
training spend 6.46193265914917 seconds
testing spend 0.057062625885009766 seconds
训练集auc:0.752049936354563
测试集auc:0.6965194979943091

1.3 Test of LightGBM on flights dataset

import lightgbm as lgb
d_train = lgb.Dataset(X_train, label=y_train)

# 设置模型参数
params = {
    
    
    "max_depth": 5, 
    "learning_rate" : 0.05, 
    "num_leaves": 500,  
    "n_estimators": 300
}

cate_features_name = ["MONTH","DAY","DAY_OF_WEEK","AIRLINE","DESTINATION_AIRPORT", "ORIGIN_AIRPORT"] # 类别特征

t0 = time.time()
model_lgb = lgb.train(params, d_train, categorical_feature = cate_features_name)
print('training spend {} seconds'.format(time.time()-t0))
t1 = time.time()

y_pred = model_lgb.predict(X_test)
print('testing spend {} seconds'.format(time.time()-t1))
y_pred_train = model_lgb.predict(X_train)
print(f"训练集auc:{
      
      roc_auc_score(y_train, y_pred_train)}")
print(f"测试集auc:{
      
      roc_auc_score(y_test, y_pred)}")
training spend 0.43550848960876465 seconds
testing spend 0.0230252742767334 seconds
训练集auc:0.8867447004324996
测试集auc:0.7033506245405025

1.2 Test of CatBoost on the flights dataset

import catboost as cb
cat_features_index = [0,1,2,3,4,5,6]

t0 = time.time()
model_cb = cb.CatBoostClassifier(eval_metric="AUC", one_hot_max_size=50, depth=6, iterations=300, l2_leaf_reg=1, learning_rate=0.1)
model_cb.fit(X_train,y_train, cat_features= cat_features_index)
print('training spend {} seconds'.format(time.time()-t0))

t1 = time.time()
y_pred = model_cb.predict(X_test)
print('testing spend {} seconds'.format(time.time()-t1))
y_pred_train = model_cb.predict(X_train)
print(f"训练集auc:{
      
      roc_auc_score(y_train, y_pred_train)}")
print(f"测试集auc:{
      
      roc_auc_score(y_test, y_pred)}")
training spend 20.633496284484863 seconds
testing spend 0.03611636161804199 seconds
训练集auc:0.5670692017560355
测试集auc:0.5473357824098838

From the above experimental results, it can be seen that without further data feature engineering and hyperparameter tuning, on this data set, both in terms of LightGBMaccuracy and speed, it is better than XGBoostand CatBoost, and CatBoostthe performance is the worst.

2 Common hyperparameter tuning methods

We call the parameters obtained without model training 超参数( hyperparameter), and the commonly used parameter tuning methods in machine learning include 网格搜索法( grid search), 随机搜索法( random search) and 贝叶斯优化( bayesian optimization).

2.1 Grid search method

The grid search method is a commonly used hyperparameter tuning method, which is often used to optimize three or fewer hyperparameters, and is essentially an exhaustive method. For each hyperparameter, the user selects a smaller finite set to search, and then performs Cartesian product of these hyperparameters to obtain several sets of hyperparameters. Grid search uses each set of hyperparameters to train the model, and selects the hyperparameter with the smallest error in the validation set as the optimal hyperparameter.

sklearnThe grid search parameter adjustment is realized through model_selectionthe module GridSearchCV, and the parameter adjustment process is cross-validated. The following shows an example of XGBoost's grid search:

# 基于XGBoost的GridSearch搜索范例
from sklearn.model_selection import GridSearchCV

model = xgb.XGBClassifier()

# 待搜索的参数列表实例
params_lst = {
    
    
    'max_depth': [3,5,7], 
    'min_child_weight': [1,3,6], 
    'n_estimators': [100,200,300],
    'learning_rate': [0.01, 0.05, 0.1]
}

# verbose:表示日志输出的详细程度
# n_jobs:表示并行计算的数量,即同时运行的任务数,-1 表示使用所有可用的 CPU 进行并行计算
t0 = time.time()
grid_search = GridSearchCV(model, param_grid=params_lst, cv=3, verbose=10, n_jobs=-1)
grid_search.fit(X_train, y_train)
print('gridsearch for xgb spend', time.time()-t0, 'seconds.')
print(grid_search.best_params_)
gridsearch for xgb spend 67.24716424942017 seconds.
{
    
    'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 6, 'n_estimators': 300}

2.2 Random Search

Random search is to search for optimal hyperparameters randomly within the specified hyperparameter range or distribution. Compared with the grid search method, given a hyperparameter distribution, not all hyperparameters are tried, but a fixed number of parameters are sampled from the given distribution, and only these sampled hyperparameters are actually tested.

sklearnRandom search is performed through the methods under the model_selection module RandomizedSearchCV.

# 基于XGBoost的RandomizedSearch搜索范例
from sklearn.model_selection import RandomizedSearchCV # 通过 n_iter 进行手动设置或是自动根据参数空间大小确定采样次数

model = xgb.XGBClassifier()

# 待搜索的参数列表实例
params_lst = {
    
    
    'max_depth': [3,5,7], 
    'min_child_weight': [1,3,6], 
    'n_estimators': [100,200,300],
    'learning_rate': [0.01, 0.05, 0.1]
}

t0 = time.time()
random_search = RandomizedSearchCV(model, params_lst, random_state=0)
random_search.fit(X_train, y_train)
print(random_search.best_params_)
print('randomsearch for xgb spend', time.time()-t0, 'seconds.')
{
    
    'n_estimators': 300, 'min_child_weight': 6, 'max_depth': 5, 'learning_rate': 0.1}
randomsearch for xgb spend 60.41917014122009 seconds.

2.3 Bayesian parameter tuning

Bayesian Optimization is a hyperparameter optimization method based on Bayesian theorem, which is mainly used to optimize black-box functions. Compared with traditional grid search or random search, Bayesian parameter tuning builds a model based on the uncertainty of sample observations, calculates the posterior distribution through Bayesian formula, and selects hyperparameters with relatively high expectations for optimization.

The key idea of ​​Bayesian parameter tuning is to use sample observations to continuously update the priors (premise assumptions) of the model, and predict the results of the next experiment based on the known premises. Through the predicted results, the posterior distribution of the model is updated, that is, the probability density function of the hyperparameter distribution after the samples are known. Based on the statistical metrics of the hyperparameter probability density function, such as expectation and variance, the next set of hyperparameter values ​​is generated until an optimal result is reached. In the whole process, the next set of hyperparameters to be tested is based on the current existing data, reversely calculates the hyperparameter probability distribution of the existing samples, and searches according to the expected value (or mode) of the hyperparameters, To a certain extent, it makes the search more "smart" and efficient.

Before performing Bayesian optimization, we need to define an objective function to be optimized based on XGBoost's cross-validation xgb.cv, obtain the cross-validation verification results of xgb.cv, and use the test set AUC as the precision measure for optimization. Finally, the defined target optimization function and hyperparameter search range are passed into the Bayesian optimization function, and the Bayesian optimization can be performed given the initialization point and the number of iterations.

# 定义相关参数
num_rounds = 3000

params = {
    
    
    'eta': 0.1,
    'silent': 1,
    'eval_metric': 'auc',
    'verbose_eval': True,
    'seed': 2023
}

# 定义目标优化函数
def xgb_evaluate(min_child_weight, colsample_bytree, max_depth, subsample, gamma,alpha):

    params['min_child_weight'] = int(min_child_weight)
    params['cosample_bytree'] = max(min(colsample_bytree, 1), 0)
    params['max_depth'] = int(max_depth)
    params['subsample'] = max(min(subsample, 1), 0)
    params['gamma'] = max(gamma, 0)
    params['alpha'] = max(alpha, 0)

    cv_result = xgb.cv(params, dtrain, num_boost_round=num_rounds, nfold=5, seed=2023, callbacks=[xgb.callback.EarlyStopping(rounds=50)])

    return cv_result['test-auc-mean'].values[-1]
# pip install bayesian-optimization
from bayes_opt import BayesianOptimization
#-*-coding:utf-8-*-

num_iter = 25
init_points = 5

t0 = time.time()
xgbBO = BayesianOptimization(xgb_evaluate, {
    
    
    'min_child_weight': (1, 20),
    'colsample_bytree': (0.1, 1),
    'max_depth': (5, 15),
    'subsample': (0.5, 1),
    'gamma': (0, 10),
    'alpha': (0, 10),
})
xgbBO.maximize(init_points=init_points, n_iter=num_iter)
print('bayesianSearch for xgb spend', time.time()-t0, 'seconds.')
|   iter    |  target   |   alpha   | colsam... |   gamma   | max_depth | min_ch... | subsample |
| 1         | 0.7193    | 7.155     | 0.5762    | 0.9578    | 9.488     | 13.18     | 0.9343    |
bayesianSearch for xgb spend 765.5525000095367 seconds.

It can be seen that the parameters obtained after Bayesian tuning make the auc value of xgboost on the flights test set reach 0.72.

Notebook_Github address

Guess you like

Origin blog.csdn.net/cjw838982809/article/details/131342976