Parameter space and hyperparameter optimization of GBDT

        

Table of contents

1. Comparison between GBDT and other algorithms under default parameters

2. Optimize GBDT based on TPE

step1: Establish benchmark

step2: Define the algorithm required by parameter init

step3: Define the objective function, parameter space, optimization function, and verification function

step4: train Bayesian optimizer

step5: Modify the search space

step6: Continue to modify the search space


Rich hyperparameters provide unlimited possibilities for integrated algorithms. The performance of Boosting algorithms with the purpose of reducing bias is even more invincible after parameter adjustment. Therefore, the automatic optimization of GBDT hyperparameters is also an important topic. Before optimizing hyperparameters for any ensemble algorithm, two basic facts need to be clarified: ① the influence of different parameters on the algorithm results; ② determine the parameter space used for search. For GBDT, the impact of each parameter on the algorithm can be roughly arranged as follows:

Influence parameter
⭐⭐⭐⭐⭐Almost
always a huge influence
n_estimators (overall learning ability)
learning_rate (overall learning rate)
max_features (randomness)
⭐⭐⭐⭐Influential
most of the time
init (initialization)
subsamples (randomness)
loss (overall learning ability)
⭐⭐It
may have a big influence,
but most of the time the influence is not obvious.
max_depth (coarse pruning)
min_samples_split (fine pruning)
min_impurity_decrease (fine pruning)
max_leaf_nodes (fine pruning)
criterion (branch sensitivity)
⭐When
the amount of data is large enough, there is almost no impact
random_state
ccp_alpha (structural risk)

        Most tree ensemble models have similar hyperparameters, such as anti-overfitting, parameter groups used for pruning ( , max_depthetc. min_samples_split), and parameters for sampling samples/features ( subsample, max_featuresetc.). These hyperparameters are in Different ensemble models affect the model in similar ways, so in principle, parameters that have a greater impact on random forest will also have a greater impact on GBDT. However, what is very critical in random forest max_depthhas no place in GBDT, and is replaced by the unique iterative parameter learning rate in Boosting learning_rate. In random forests, we always care about the balance between model complexity ( max_depth) and the model’s overall learning ability ( n_estimators). The greater the complexity of a single weak evaluator, the greater the overall contribution of a single weak evaluator to the model, so the required There are fewer trees. In the Boosting algorithm, the contribution of a single weak evaluator to the overall algorithm is learning_ratecontrolled by the learning rate parameter, replacing the complexity of the weak evaluator. Therefore, in the Boosting algorithm, what we are looking for is the balance of learning_rateand n_estimators. At the same time, the Boosting algorithm inherently assumes that the ability of a single weak estimator is very weak, and the max_depthdefault values ​​of parameters are often small (the default value in GBDT max_depthis 3), so we cannot rely on lowered max_depthvalues ​​to reduce model complexity on a large scale. It is more difficult to rely on max_depthto control overfitting, and the natural max_depthinfluence becomes smaller. It can be seen that although most tree ensemble algorithms share the same hyperparameters, due to different principle assumptions when constructing different algorithms, the default values ​​of the same parameters in different algorithms may be set differently, so the same parameters are important in different algorithms. The nature and parameter adjustment ideas are also different .

        In solving random forests, the effectiveness of fine pruning tools is limited, and drastic rough pruning is generally more effective. In GBDT, since max_depththe default value of this rough pruning tool is 3, the idea of ​​controlling overfitting by reducing model complexity in the Boosting algorithm cannot be adopted. In particular, parameters inithave a great impact on GBDT. If initspecific algorithms are filled in the parameters, overfitting may become more serious. Therefore, we need to work hard on suppressing overfitting and controlling complexity. If the weak evaluator cannot be pruned, the best way to control overfitting is to increase randomness/diversity. Therefore, sum max_featuresbecomes subsamplethe core weapon to control overfitting in the Boosting algorithm. This is why the Bagging idea will be added to GBDT. one of the key reasons. Relying on randomness rather than weak estimator structure to combat overfitting gives the Boosting algorithm an unexpected advantage: Compared with Bagging, Boosting is better at processing small sample high-dimensional data, because Bagging data can easily Overfitting on small sample data sets.

        It should be noted that although max_depthit does not contribute much to controlling overfitting, we still need to retain this parameter when adjusting parameters. When we use parameters max_featuresand subsampleconstruct randomness, and increase the difference between each tree, the learning ability of the model may be affected, so we may need to increase the complexity of a single weak estimator. Therefore, in GBDT, max_depththe best parameter adjustment direction is to enlarge/deepen to explore whether the model requires higher single estimator complexity. In contrast, in random forests, max_depththe direction of parameter adjustment is reduction/pruning to alleviate overfitting.

        So what parameters should we choose when adjusting parameters? First consider all parameters with great influence. When the computing power is sufficient/the optimization algorithm runs quickly, we can consider adding the parameters that have influence most of the time to the parameter space. If the sample size is small, we may not choose subsample. In addition to this, we also need some parameters that affect the complexity of the weak evaluator, max_depthe.g. If the computing power is sufficient, we can also add criterionparameters that may be effective. Under this basic idea, taking into account hardware and running time factors, the following parameters will be selected for adjustment, and TPE-based Bayesian optimization (HyperOpt) will be used to optimize GBDT:

parameter scope
loss 4 optional loss functions in regression loss
["squared_error", "absolute_error", "huber", "quantile"]
criterion All 4 optional impurity evaluation indicators
["friedman_mse", "squared_error"]
init HyperOpt does not support search, manual parameter adjustment
n_estimators After early stopping to confirm the middle number of 50, the final range is set to (25,200,25)
learning_rate Taking 1.0 as the center and extending to both sides, the final range is set to (0.05, 2.05, 0.05).
If the computing power is limited, it can also be set to (0.1, 2.1, 0.1)
max_features All strings, plus the value between sqrt and auto
subsample The value range of the subsample parameter is (0,1], so if the fixed range (0.1,0.8,0.1) is
limited, it can also be set as (0.5,0.8,0.1)
max_depth Taking 3 as the center and extending to both sides, the right side is set to be larger. Final confirmation(2,30,2)
min_impurity_decrease For parameters that can only be enlarged but cannot be reduced, try the (0,5,1) range first.

Generally, in the first search, we will set a larger and sparse parameter space, and then gradually narrow the scope and reduce the dimension of the parameter space during multiple searches. It should be noted that initthe evaluator object that needs to be entered in the parameters cannot be recognized by the HyperOpt library, so initwe can only manually adjust the parameters.

1. Comparison between GBDT and other algorithms under default parameters

Import the required libraries and data:

import pandas as pd
import numpy as np
import sklearn
import matplotlib as mlp
import matplotlib.pyplot as plt
import time
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.ensemble import GradientBoostingRegressor as GBR
from sklearn.ensemble import AdaBoostRegressor as ABR
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.model_selection import cross_validate, KFold

#导入优化算法
import hyperopt
from hyperopt import hp, fmin, tpe, Trials, partial
from hyperopt.early_stop import no_progress_loss

data = pd.read_csv(r"F:\\Jupyter Files\\机器学习进阶\\datasets\\House Price\\train_encode.csv",index_col=0)
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
X.shape #(1460, 80)
cv = KFold(n_splits=5,shuffle=True,random_state=1412)

def RMSE(result,name):
    return abs(result[name].mean())
modelname = ["GBDT","RF","AdaBoost","RF-TPE","Ada-TPE"]

models = [GBR(random_state=1412)
         ,RFR(random_state=1412)
         ,ABR(random_state=1412)
         ,RFR(n_estimators=89, max_depth=22, max_features=14, min_impurity_decrease=0
              ,random_state=1412, verbose=False)
         ,ABR(n_estimators=39, learning_rate=0.94,loss="exponential"
              ,random_state=1412)]

colors = ["green","gray","orange","red","blue"]
for name,model in zip(modelname,models):
    start = time.time()
    result = cross_validate(model,X,y,cv=cv,scoring="neg_root_mean_squared_error"
                            ,return_train_score=True
                            ,verbose=False)
    end = time.time()-start
    print(name)
    print("\t train_score:{:.3f}".format(RMSE(result,"train_score")))
    print("\t test_score:{:.3f}".format(RMSE(result,"test_score")))
    print("\t time:{:.2f}s".format(end))
    print("\n")
--------------------------------------------------------------------------------------
GBDT
	 train_score:13990.791
	 test_score:28783.954
	 time:2.16s


RF
	 train_score:11177.272
	 test_score:30571.267
	 time:5.35s


AdaBoost
	 train_score:27062.107
	 test_score:35345.931
	 time:0.99s


RF-TPE
	 train_score:11208.818
	 test_score:28346.673
	 time:1.36s


Ada-TPE
	 train_score:27401.542
	 test_score:35169.730
	 time:0.83s

 

2. Optimize GBDT based on TPE

step1: Establish benchmark
algorithm RF AdaBoost GBDT RF
(TPE)
AdaBoost
(TPE)
50% off verification
running time
5.35s 0.99s 2.16s 1.36s 0.83s
Optimal score
(RMSE)
30571.267 35345.931 28783.954 28346.673 35169.73
step2: define initthe algorithm required for parameters
rf = RFR(n_estimators=89, max_depth=22, max_features=14,min_impurity_decrease=0
         ,random_state=1412, verbose=False)
step3: Define the objective function, parameter space, optimization function, and verification function

① Objective function

def hyperopt_objective(params):
    reg = GBR(n_estimators = int(params["n_estimators"])
              ,learning_rate = params["lr"]
              ,criterion = params["criterion"]
              ,loss = params["loss"]
              ,max_depth = int(params["max_depth"])
              ,max_features = params["max_features"]
              ,subsample = params["subsample"]
              ,min_impurity_decrease = params["min_impurity_decrease"]
              ,init = rf
              ,random_state=1412
              ,verbose=False)
    
    cv = KFold(n_splits=5,shuffle=True,random_state=1412)
    validation_loss = cross_validate(reg,X,y
                                     ,scoring="neg_root_mean_squared_error"
                                     ,cv=cv
                                     ,verbose=False
                                     ,error_score='raise'
                                    )
    return np.mean(abs(validation_loss["test_score"]))

② Parameter space

parameter scope
loss 4 optional loss functions in regression loss
["squared_error", "absolute_error", "huber", "quantile"]
criterion All 4 optional impurity evaluation indicators
["friedman_mse", "squared_error"]
init HyperOpt does not support search, manual parameter adjustment
n_estimators After early stopping to confirm the middle number of 50, the final range is set to (25,200,25)
learning_rate Taking 1.0 as the center and extending to both sides, the final range is set to (0.05, 2.05, 0.05).
If the computing power is limited, it can also be set to (0.1, 2.1, 0.1)
max_features 所有字符串,外加sqrt与auto中间的值
subsample subsample参数的取值范围为(0,1],因此定范围(0.1,0.8,0.1)
如果算力有限,也可定为(0.5,0.8,0.1)
max_depth 以3为中心向两边延展,右侧范围定得更大。最后确认(2,30,2)
min_impurity_decrease 只能放大、不能缩小的参数,先尝试(0,5,1)范围
param_grid_simple = {'n_estimators': hp.quniform("n_estimators",25,200,25)
                  ,"lr": hp.quniform("learning_rate",0.05,2.05,0.05)
                  ,"criterion": hp.choice("criterion",["friedman_mse", "squared_error"])
                  ,"loss":hp.choice("loss",["squared_error","absolute_error", "huber", "quantile"])
                  ,"max_depth": hp.quniform("max_depth",2,30,2)
                  ,"subsample": hp.quniform("subsample",0.1,0.8,0.1)
                  ,"max_features": hp.choice("max_features",["log2","sqrt",16,32,64,1.0])
                  ,"min_impurity_decrease":hp.quniform("min_impurity_decrease",0,5,1)
                 }

③ 优化函数

def param_hyperopt(max_evals=100):
    
    #保存迭代过程
    trials = Trials()
    
    #设置提前停止
    early_stop_fn = no_progress_loss(100)
    
    #定义代理模型
    params_best = fmin(hyperopt_objective
                       , space = param_grid_simple
                       , algo = tpe.suggest
                       , max_evals = max_evals
                       , verbose=True
                       , trials = trials
                       , early_stop_fn = early_stop_fn
                      )
    
    #打印最优参数,fmin会自动打印最佳分数
    print("\n","\n","best params: ", params_best,
          "\n")
    return params_best, trials

④ 验证函数

def hyperopt_validation(params):    
    reg = GBR(n_estimators = int(params["n_estimators"])
              ,learning_rate = params["learning_rate"]
              ,criterion = params["criterion"]
              ,loss = params["loss"]
              ,max_depth = int(params["max_depth"])
              ,max_features = params["max_features"]
              ,subsample = params["subsample"]
              ,min_impurity_decrease = params["min_impurity_decrease"]
              ,init = rf
              ,random_state=1412 #GBR中的random_state只能够控制特征抽样,不能控制样本抽样
              ,verbose=False)
    cv = KFold(n_splits=5,shuffle=True,random_state=1412)
    validation_loss = cross_validate(reg,X,y
                                     ,scoring="neg_root_mean_squared_error"
                                     ,cv=cv
                                     ,verbose=False
                                    )
    return np.mean(abs(validation_loss["test_score"]))
step4:训练贝叶斯优化器
params_best, trials = param_hyperopt(30) #使用小于0.1%的空间进行训练
100%|████████████████████████████████████████████████| 30/30 [02:43<00:00,  5.45s/trial, best loss: 26847.550613053456]
 
 best params:  {'criterion': 0, 'learning_rate': 0.05, 'loss': 0, 'max_depth': 14.0, 'max_features': 2, 'min_impurity_decrease': 3.0, 'n_estimators': 125.0, 'subsample': 0.5} 
params_best #注意hp.choice返回的结果是索引,而不是具体的数字
{'criterion': 0,
 'learning_rate': 0.05,
 'loss': 0,
 'max_depth': 14.0,
 'max_features': 2,
 'min_impurity_decrease': 3.0,
 'n_estimators': 125.0,
 'subsample': 0.5}
hyperopt_validation({'criterion': "friedman_mse",
                     'learning_rate': 0.05,
                     'loss': "squared_error",
                     'max_depth': 14.0,
                     'max_features': 16,
                     'min_impurity_decrease': 3.0,
                     'n_estimators': 125.0,
                     'subsample': 0.5})
26847.550613053456

不难发现,我们已经得到了历史最好分数,但GBDT的潜力远不止如此。现在我们可以根据第一次训练出的结果缩小参数空间,继续进行搜索。在多次搜索中,我发现loss参数的最优选项基本都是平方误差"squared_error",因此我们可以将该参数排除出搜索队伍。同样,对于其他参数,我们则根据搜索结果修改空间范围、增加空间密度,一般以被选中的值为中心向两边拓展,并减小步长,同时范围可以向我们认为会被选中的一边倾斜。例如最大深度max_depth被选为14,我们则将原本的范围(2,30,2)修改为(10,25,1)。同样subsample被选为0.5,我们则将新范围调整为(0.3,0.7,0.05),依次类推。

step5:修改搜索空间
param_grid_simple = {'n_estimators': hp.quniform("n_estimators",100,180,5)
                     ,"lr": hp.quniform("learning_rate",0.02,0.2,0.04)
                     ,"criterion": hp.choice("criterion",["friedman_mse", "squared_error"])
                     ,"max_depth": hp.quniform("max_depth",10,25,1)
                     ,"subsample": hp.quniform("subsample",0.3,0.7,0.05)
                     ,"max_features": hp.quniform("max_features",10,20,1)
                     ,"min_impurity_decrease":hp.quniform("min_impurity_decrease",0,5,0.5)
                    }

由于需要修改参数空间,因此目标函数也必须跟着修改:

def hyperopt_objective(params):
    reg = GBR(n_estimators = int(params["n_estimators"])
              ,learning_rate = params["lr"]
              ,criterion = params["criterion"]
              ,max_depth = int(params["max_depth"])
              ,max_features = int(params["max_features"])
              ,subsample = params["subsample"]
              ,min_impurity_decrease = params["min_impurity_decrease"]
              ,loss = "squared_error"
              ,init = rf
              ,random_state=1412
              ,verbose=False)    
    cv = KFold(n_splits=5,shuffle=True,random_state=1412)
    validation_loss = cross_validate(reg,X,y
                                     ,scoring="neg_root_mean_squared_error"
                                     ,cv=cv
                                     ,verbose=False
                                     ,error_score='raise')
    return np.mean(abs(validation_loss["test_score"]))



def param_hyperopt(max_evals=100):    
    #保存迭代过程
    trials = Trials() 
    #设置提前停止
    early_stop_fn = no_progress_loss(100)    
    #定义代理模型
    params_best = fmin(hyperopt_objective
                       , space = param_grid_simple
                       , algo = tpe.suggest
                       , max_evals = max_evals
                       , verbose=True
                       , trials = trials
                       , early_stop_fn = early_stop_fn)    
    #打印最优参数,fmin会自动打印最佳分数
    print("\n","\n","best params: ", params_best,
          "\n")
    return params_best, trials
params_best, trials = param_hyperopt(30) #使用小于0.1%的空间进行训练
100%|█████████████████████████████████████████████████| 30/30 [01:24<00:00,  2.82s/trial, best loss: 26673.75433067303]

 best params:  {'criterion': 0, 'learning_rate': 0.04, 'max_depth': 12.0, 'max_features': 13.0, 'min_impurity_decrease': 3.5, 'n_estimators': 110.0, 'subsample': 0.7000000000000001}
params_best, trials = param_hyperopt(60) #尝试增加搜索次数
100%|█████████████████████████████████████████████████| 60/60 [03:14<00:00,  3.24s/trial, best loss: 26736.01565552259]

 best params:  {'criterion': 0, 'learning_rate': 0.08, 'max_depth': 13.0, 'max_features': 15.0, 'min_impurity_decrease': 1.0, 'n_estimators': 145.0, 'subsample': 0.7000000000000001}

基于该结果,我们又可以确定进一步确定部分参数的值(比如criterion),再次缩小参数范围、增加参数空间的密集程度。

step6:继续修改搜索空间
param_grid_simple = {'n_estimators': hp.quniform("n_estimators",100,150,1)
                     ,"lr": hp.quniform("learning_rate",0.01,0.1,0.005)
                     ,"max_depth": hp.quniform("max_depth",10,16,1)
                     ,"subsample": hp.quniform("subsample",0.65,0.85,0.0025)
                     ,"max_features": hp.quniform("max_features",12,16,1)
                     ,"min_impurity_decrease":hp.quniform("min_impurity_decrease",1,4,0.25)
                    }
def hyperopt_objective(params):
    reg = GBR(n_estimators = int(params["n_estimators"])
              ,learning_rate = params["lr"]
              ,max_depth = int(params["max_depth"])
              ,max_features = int(params["max_features"])
              ,subsample = params["subsample"]
              ,min_impurity_decrease = params["min_impurity_decrease"]
              ,criterion = "friedman_mse"
              ,loss = "squared_error"
              ,init = rf
              ,random_state=1412
              ,verbose=False)    
    cv = KFold(n_splits=5,shuffle=True,random_state=1412)
    validation_loss = cross_validate(reg,X,y
                                     ,scoring="neg_root_mean_squared_error"
                                     ,cv=cv
                                     ,verbose=False
                                     ,error_score='raise')
    return np.mean(abs(validation_loss["test_score"]))

def param_hyperopt(max_evals=100):    
    #保存迭代过程
    trials = Trials()    
    #设置提前停止
    early_stop_fn = no_progress_loss(100)    
    #定义代理模型
    params_best = fmin(hyperopt_objective
                       , space = param_grid_simple
                       , algo = tpe.suggest
                       , max_evals = max_evals
                       , verbose=True
                       , trials = trials
                       , early_stop_fn = early_stop_fn)    
    #打印最优参数,fmin会自动打印最佳分数
    print("\n","\n","best params: ", params_best,
          "\n")
    return params_best, trials
params_best, trials = param_hyperopt(300) #缩小参数空间的同时增加迭代次数
43%|███████████████████▉                          | 130/300 [06:12<08:06,  2.86s/trial, best loss: 26563.111168263265]

 best params:  {'learning_rate': 0.055, 'max_depth': 10.0, 'max_features': 13.0, 'min_impurity_decrease': 1.25, 'n_estimators': 124.0, 'subsample': 0.7675000000000001} 

关闭提前停止,继续迭代:

def param_hyperopt(max_evals=100):    
    #保存迭代过程
    trials = Trials()
    #定义代理模型
    params_best = fmin(hyperopt_objective
                       , space = param_grid_simple
                       , algo = tpe.suggest
                       , max_evals = max_evals
                       , verbose=True
                       , trials = trials)  
    #打印最优参数,fmin会自动打印最佳分数
    print("\n","\n","best params: ", params_best,
          "\n")
    return params_best, trials
params_best, trials = param_hyperopt(300) #取消提前停止,继续迭代
100%|███████████████████████████████████████████████| 300/300 [13:52<00:00,  2.77s/trial, best loss: 26412.12758959595]

 best params:  {'learning_rate': 0.085, 'max_depth': 10.0, 'max_features': 13.0, 'min_impurity_decrease': 2.75, 'n_estimators': 145.0, 'subsample': 0.675} 
start = time.time()
hyperopt_validation({'criterion': "friedman_mse",
                     'learning_rate': 0.085,
                     'loss': "squared_error",
                     'max_depth': 10.0,
                     'max_features': 13,
                     'min_impurity_decrease': 2.75,
                     'n_estimators': 145.0,
                     'subsample': 0.675})
26412.12758959595
end = (time.time() - start)
print(end)
2.7066071033477783
算法 RF AdaBoost GBDT RF
(TPE)
AdaBoost
(TPE)

GBDT

(TPE)

5折验证
运行时间
5.35s 0.99s 2.16s 1.36s 0.83s 2.70s(↑)
最优分数
(RMSE)
30571.267 35345.931 28783.954 28346.673 35169.73 26412.1278(↓)

GBDT获得了目前为止的最高分,虽然这一组参数最终指向了145棵树,导致GBDT运行所需的时间远远高于其他算法,GBDT上得到的分数是比精细调参后的随机森林还低2000左右,这证明了GBDT在学习能力上的优越性。由于TPE是带有强随机性的过程,因此如果我们多次运行,我们将得到不同的结果,但GBDT的预测分数可以稳定在26500上下。如果算力支持使用更多的迭代次数、或使用更大更密集的参数空间,我们或许可以得到更好的分数。同时,如果能够找到一组大学习率、小迭代次数的参数,那GBDT的训练速度也会随之上升。

 

Guess you like

Origin blog.csdn.net/weixin_60200880/article/details/131968217