GBDT参数调优
- 框架参数
- n_estimators: 弱学习器的最大迭代次数,或者说最大的弱学习器的个数。
- learning_rate: 每个弱学习器的权重缩减系数ν,ν的取值范围为0<ν≤1。
- subsample: 子采样,取值为(0,1]。
- init: 即初始化的时候的弱学习器。
- loss: 即我们GBDT算法中的损失函数。
- alpha:这个参数只有GradientBoostingRegressor有,当我们使用Huber损失"huber"和分位数损失“quantile”时,需要指定分位数的值。默认是0.9,如果噪音点较多,可以适当降低这个分位数的值。
GBDT使用了CART回归决策树,因此它的参数基本来源于决策树类
- 决策树参数
- max_features:划分时考虑的最大特征数
- max_depth:决策树最大深度
- min_samples_split:内部节点再划分所需最小样本数
- min_samples_leaf:叶子节点最少样本数
- min_weight_fraction_leaf:叶子节点最小的样本权重和
- max_leaf_nodes:最大叶子节点数
- min_impurity_split:节点划分最小不纯度
- 加载类库
In [1]:
import pandas as pd import numpy as np from sklearn.ensemble import GradientBoostingClassifier from sklearn import cross_validation, metrics from sklearn.model_selection import GridSearchCV import matplotlib.pylab as plt %matplotlib inline
- 用pandas分析数据
In [2]:
train = pd.read_csv('train_modified.csv') target = 'Disbursed' # 二分类 IDcol = 'ID' train['Disbursed'].value_counts()
Out[2]:
0 19680 1 320 Name: Disbursed, dtype: int64
- 得到训练集。最后一列Disbursed是分类输出。前面的所有列(不考虑ID列)都是样本特征
In [3]:
x_columns = [x for x in train.columns if x not in [target, IDcol]] X = train[x_columns] y = train['Disbursed']
- 不调参,默认参数
In [4]:
gbm0 = GradientBoostingClassifier(random_state=10) gbm0.fit(X,y) y_pred = gbm0.predict(X) y_predprob = gbm0.predict_proba(X)[:,1] print("Accuracy : %.4g" % metrics.accuracy_score(y.values, y_pred)) print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
Accuracy : 0.9852 AUC Score (Train): 0.900531
调参
- 首先,从步长(learning rate)和迭代次数(n_estimators)入手。
- 开始选择一个较小的步长来网格搜索最好的迭代次数。
- 我们将步长初始值设置为0.1。对于迭代次数进行网格搜索如下:
In [5]:
param_test1 = {'n_estimators':list(range(20,81,10))} gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=300, min_samples_leaf=20,max_depth=8,max_features='sqrt', subsample=0.8,random_state=10), param_grid = param_test1, scoring='roc_auc',iid=False,cv=5) gsearch1.fit(X,y) gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_
Out[5]:
({'mean_fit_time': array([ 0.46405239, 0.60372586, 0.75624413, 0.89232144, 1.07256217, 1.20764227, 1.36656241]), 'mean_score_time': array([ 0.00360017, 0.00936003, 0.00311999, 0.00936007, 0.01247988, 0.01560001, 0.02184005]), 'mean_test_score': array([ 0.81284735, 0.81437929, 0.81451108, 0.81618196, 0.81778932, 0.81533362, 0.81321535]), 'mean_train_score': array([ 0.92266699, 0.93547209, 0.94317783, 0.95054641, 0.95735062, 0.96186315, 0.96634598]), 'param_n_estimators': masked_array(data = [20 30 40 50 60 70 80], mask = [False False False False False False False], fill_value = ?), 'params': [{'n_estimators': 20}, {'n_estimators': 30}, {'n_estimators': 40}, {'n_estimators': 50}, {'n_estimators': 60}, {'n_estimators': 70}, {'n_estimators': 80}], 'rank_test_score': array([7, 5, 4, 2, 1, 3, 6]), 'split0_test_score': array([ 0.80985614, 0.8153225 , 0.81237297, 0.81722204, 0.81813905, 0.81521731, 0.81381201]), 'split0_train_score': array([ 0.92169041, 0.93513613, 0.94237537, 0.94910189, 0.95769705, 0.96291767, 0.96594151]), 'split1_test_score': array([ 0.79491394, 0.79054918, 0.79181156, 0.79243879, 0.7998821 , 0.79331412, 0.79071392]), 'split1_train_score': array([ 0.92540536, 0.93807301, 0.94369035, 0.95120239, 0.95651828, 0.96082188, 0.96626927]), 'split2_test_score': array([ 0.7933955 , 0.80116036, 0.80013219, 0.80208532, 0.79902066, 0.79699608, 0.79585279]), 'split2_train_score': array([ 0.91999321, 0.93661623, 0.94265511, 0.94995018, 0.95538442, 0.95924315, 0.96427458]), 'split3_test_score': array([ 0.81871268, 0.81660871, 0.82047526, 0.82303774, 0.82663634, 0.82647358, 0.82416317]), 'split3_train_score': array([ 0.92746034, 0.93637333, 0.9459489 , 0.95354778, 0.95929489, 0.9638652 , 0.96732064]), 'split4_test_score': array([ 0.84735852, 0.84825568, 0.84776343, 0.84612591, 0.84526844, 0.84466702, 0.84153487]), 'split4_train_score': array([ 0.91878565, 0.93116177, 0.94121942, 0.94892983, 0.95785845, 0.96246784, 0.96792392]), 'std_fit_time': array([ 0.03545224, 0.03637669, 0.03127342, 0.00623996, 0.02454179, 0.02329797, 0.01590896]), 'std_score_time': array([ 4.58717312e-03, 7.64243030e-03, 6.23998642e-03, 7.64246923e-03, 6.23993874e-03, 1.78416128e-07, 7.64231350e-03]), 'std_test_score': array([ 0.01966902, 0.01947349, 0.01932813, 0.01847797, 0.01735756, 0.0190036 , 0.01860096]), 'std_train_score': array([ 0.00327544, 0.00234852, 0.00159337, 0.00170259, 0.00132036, 0.00163918, 0.00125698])}, {'n_estimators': 60}, 0.81778931656504061)
- 找到了一个合适的迭代次数(如上为60),现在我们开始对决策树进行调参。
- 首先我们对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索。
In [6]:
param_test2 = {'max_depth':list(range(3,14,2)), 'min_samples_split':list(range(100,801,200))} gsearch2 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60, min_samples_leaf=20, max_features='sqrt', subsample=0.8, random_state=10), param_grid = param_test2, scoring='roc_auc',iid=False, cv=5) gsearch2.fit(X,y) gsearch2.cv_results_, gsearch2.best_params_, gsearch2.best_score_
Out[6]:
({'mean_fit_time': array([ 0.5040072 , 0.4648807 , 0.45240068, 0.45864086, 0.72764158, 0.68640118, 0.66456118, 0.66424136, 1.03916183, 0.94536152, 0.88920159, 0.8642415 , 1.3946425 , 1.16688209, 1.06724191, 1.02024179, 1.70352292, 1.39584255, 1.2236423 , 1.14816208, 2.01656666, 1.50384254, 1.29792228, 1.19808207]), 'mean_score_time': array([ 0.00512018, 0.00312004, 0.00623994, 0.00311995, 0.00935993, 0.00936007, 0.01247993, 0.00624008, 0.01248007, 0.01248012, 0.01248002, 0.01560006, 0.01559997, 0.01559992, 0.01247993, 0.01248002, 0.01560001, 0.01748028, 0.01559992, 0.01560006, 0.0187201 , 0.0187201 , 0.01560001, 0.01872015]), 'mean_test_score': array([ 0.81198869, 0.81267388, 0.81237654, 0.80924956, 0.81845981, 0.81630145, 0.81314548, 0.81261631, 0.8180676 , 0.82137243, 0.8170295 , 0.81383067, 0.81107287, 0.80944487, 0.81476158, 0.81600927, 0.81100776, 0.81309427, 0.81712994, 0.81346505, 0.81483581, 0.8082464 , 0.81923074, 0.81382074]), 'mean_train_score': array([ 0.86494381, 0.8622539 , 0.86106945, 0.85794241, 0.91987404, 0.91122178, 0.90112146, 0.89483856, 0.95660931, 0.94393214, 0.9326904 , 0.92536259, 0.98193106, 0.96538959, 0.95290674, 0.94331725, 0.99221707, 0.97872749, 0.96704603, 0.95578115, 0.99648755, 0.98611237, 0.97292428, 0.96347527]), 'param_max_depth': masked_array(data = [3 3 3 3 5 5 5 5 7 7 7 7 9 9 9 9 11 11 11 11 13 13 13 13], mask = [False False False False False False False False False False False False False False False False False False False False False False False False], fill_value = ?), 'param_min_samples_split': masked_array(data = [100 300 500 700 100 300 500 700 100 300 500 700 100 300 500 700 100 300 500 700 100 300 500 700], mask = [False False False False False False False False False False False False False False False False False False False False False False False False], fill_value = ?), 'params': [{'max_depth': 3, 'min_samples_split': 100}, {'max_depth': 3, 'min_samples_split': 300}, {'max_depth': 3, 'min_samples_split': 500}, {'max_depth': 3, 'min_samples_split': 700}, {'max_depth': 5, 'min_samples_split': 100}, {'max_depth': 5, 'min_samples_split': 300}, {'max_depth': 5, 'min_samples_split': 500}, {'max_depth': 5, 'min_samples_split': 700}, {'max_depth': 7, 'min_samples_split': 100}, {'max_depth': 7, 'min_samples_split': 300}, {'max_depth': 7, 'min_samples_split': 500}, {'max_depth': 7, 'min_samples_split': 700}, {'max_depth': 9, 'min_samples_split': 100}, {'max_depth': 9, 'min_samples_split': 300}, {'max_depth': 9, 'min_samples_split': 500}, {'max_depth': 9, 'min_samples_split': 700}, {'max_depth': 11, 'min_samples_split': 100}, {'max_depth': 11, 'min_samples_split': 300}, {'max_depth': 11, 'min_samples_split': 500}, {'max_depth': 11, 'min_samples_split': 700}, {'max_depth': 13, 'min_samples_split': 100}, {'max_depth': 13, 'min_samples_split': 300}, {'max_depth': 13, 'min_samples_split': 500}, {'max_depth': 13, 'min_samples_split': 700}], 'rank_test_score': array([19, 16, 18, 23, 3, 7, 14, 17, 4, 1, 6, 11, 20, 22, 10, 8, 21, 15, 5, 13, 9, 24, 2, 12]), 'split0_test_score': array([ 0.81595568, 0.81469131, 0.81249008, 0.81140037, 0.81834548, 0.813157 , 0.81389934, 0.81455435, 0.81247618, 0.81584651, 0.81171002, 0.81463375, 0.81160283, 0.79811754, 0.80339534, 0.8241334 , 0.80091821, 0.79846688, 0.81144404, 0.80298249, 0.80540404, 0.80905623, 0.81889926, 0.80580896]), 'split0_train_score': array([ 0.86212766, 0.85983599, 0.85925715, 0.85583682, 0.92027642, 0.91036379, 0.89830898, 0.89271483, 0.95575894, 0.94462226, 0.93295511, 0.9235495 , 0.98033167, 0.96412311, 0.95247272, 0.93674885, 0.99078257, 0.9756765 , 0.96545844, 0.95588014, 0.99602589, 0.98666841, 0.972869 , 0.9635465 ]), 'split1_test_score': array([ 0.80087851, 0.79917945, 0.80264704, 0.79988607, 0.80374865, 0.800158 , 0.80373277, 0.79130939, 0.79127564, 0.80364147, 0.80661085, 0.78179584, 0.78692677, 0.78991997, 0.79036061, 0.78085302, 0.78124206, 0.77386226, 0.79696233, 0.77971966, 0.79841527, 0.77129978, 0.8000925 , 0.78926496]), 'split1_train_score': array([ 0.87120242, 0.86666361, 0.86601828, 0.86354586, 0.92061919, 0.9143518 , 0.90965767, 0.89885879, 0.95923236, 0.94287258, 0.94052571, 0.93206341, 0.98074341, 0.96363831, 0.95255943, 0.94647502, 0.99102225, 0.97921803, 0.96812265, 0.95409524, 0.99580569, 0.98611847, 0.9731619 , 0.96292921]), 'split2_test_score': array([ 0.77799678, 0.78258186, 0.78163904, 0.77523183, 0.79288737, 0.79426885, 0.78249849, 0.78640673, 0.80465376, 0.80370498, 0.7928437 , 0.79515808, 0.78876278, 0.78355445, 0.80375659, 0.79261346, 0.79967964, 0.80688874, 0.79119228, 0.79433633, 0.79785752, 0.80018777, 0.80151367, 0.79806394]), 'split2_train_score': array([ 0.86714334, 0.86632929, 0.86495612, 0.8616039 , 0.91754957, 0.91427265, 0.89810478, 0.89679637, 0.95248153, 0.94382396, 0.92439481, 0.92567903, 0.98094835, 0.96515153, 0.95307513, 0.94545206, 0.9926604 , 0.98029842, 0.96971068, 0.95601673, 0.99651765, 0.98676368, 0.97312915, 0.9661503 ]), 'split3_test_score': array([ 0.83055648, 0.83071527, 0.829175 , 0.82860733, 0.83614988, 0.83314675, 0.83231509, 0.83529638, 0.83712644, 0.83983184, 0.83185063, 0.83223371, 0.82318066, 0.82017554, 0.83634043, 0.83131669, 0.83193201, 0.83867664, 0.82928219, 0.84128081, 0.83871237, 0.81992743, 0.83226348, 0.82270627]), 'split3_train_score': array([ 0.86569722, 0.86175636, 0.86113051, 0.85574279, 0.92692938, 0.91676802, 0.90241607, 0.89450507, 0.95932218, 0.94546174, 0.93014278, 0.92215499, 0.98479307, 0.96734061, 0.95563389, 0.942621 , 0.99382987, 0.98099536, 0.96486744, 0.95758627, 0.99731086, 0.98479133, 0.9746033 , 0.9633598 ]), 'split4_test_score': array([ 0.83455602, 0.83620149, 0.83593155, 0.83112217, 0.84116767, 0.84077665, 0.83328173, 0.83551472, 0.84480596, 0.84383733, 0.84213232, 0.84533195, 0.84489131, 0.85545684, 0.8399549 , 0.8511298 , 0.84126691, 0.84757685, 0.85676885, 0.84900597, 0.83378986, 0.84076077, 0.84338478, 0.85325958]), 'split4_train_score': array([ 0.85854842, 0.85668424, 0.85398517, 0.85298268, 0.91399563, 0.90035266, 0.89711979, 0.89131772, 0.95625156, 0.94288015, 0.93543361, 0.92336602, 0.98283882, 0.96669441, 0.95079251, 0.9452893 , 0.99279028, 0.97744912, 0.96707091, 0.95532735, 0.99677767, 0.98621995, 0.97085807, 0.96139055]), 'std_fit_time': array([ 0.03818128, 0.01167409, 0.00986621, 0.00764233, 0.00829129, 0.00986628, 0.01247991, 0.0154769 , 0.01618613, 0.00764239, 0.01708904, 0.00764245, 0.02116084, 0.0152848 , 0.01213157, 0.01248006, 0.04972466, 0.01561664, 0.00723594, 0.00764243, 0.0453095 , 0.01590906, 0.01167405, 0.00623994]), 'std_score_time': array([ 6.51612114e-03, 6.24008179e-03, 7.64233297e-03, 6.23989105e-03, 7.64235243e-03, 7.64246923e-03, 6.23996258e-03, 7.64250817e-03, 6.24003410e-03, 6.24005795e-03, 6.24001026e-03, 1.16800773e-07, 1.50789149e-07, 1.78416128e-07, 6.23996258e-03, 6.24001026e-03, 1.78416128e-07, 3.76062393e-03, 9.53674316e-08, 1.16800773e-07, 6.24003411e-03, 6.24003411e-03, 9.53674316e-08, 6.24012947e-03]), 'std_test_score': array([ 0.02073003, 0.01985316, 0.01937264, 0.02050678, 0.01843348, 0.01810378, 0.01898077, 0.02089692, 0.02003588, 0.01733484, 0.01772916, 0.02326604, 0.0217777 , 0.02612314, 0.01972845, 0.02575692, 0.02222414, 0.0269634 , 0.02379388, 0.02702376, 0.017755 , 0.02290974, 0.01693249, 0.02258241]), 'std_train_score': array([ 0.00432221, 0.00382542, 0.00431445, 0.00396676, 0.00425332, 0.00580929, 0.00463825, 0.00272077, 0.00253495, 0.00100568, 0.00537205, 0.00353732, 0.00167029, 0.00143085, 0.00156491, 0.00352269, 0.00114992, 0.00193879, 0.0017622 , 0.00112889, 0.00053684, 0.00070571, 0.00119915, 0.00153743])}, {'max_depth': 7, 'min_samples_split': 300}, 0.82137242759146323)
- 如上输出中最好的深度是 7,内部节点划分是300 。
- 由于决策树深度7是一个比较合理的值,我们把它定下来,对于内部节点再划分所需最小样本数min_samples_split,
- 我们暂时不能一起定下来,因为这个还和决策树其他的参数存在关联。
- 下面我们再对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参。
In [7]:
param_test3 = {'min_samples_split':list(range(800,1900,200)), 'min_samples_leaf':list(range(60,101,10))} gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=7, max_features='sqrt', subsample=0.8, random_state=10), param_grid = param_test3, scoring='roc_auc',iid=False, cv=5) gsearch3.fit(X,y) gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_
Out[7]:
({'mean_fit_time': array([ 0.98147159, 0.91485825, 0.81120129, 0.79560137, 0.76364188, 0.73632126, 0.84240146, 0.81744151, 0.80808153, 0.79244418, 0.75816131, 0.73712134, 0.84552155, 0.81472139, 0.79560137, 0.76996374, 0.7618021 , 0.73048124, 0.85560212, 0.81432137, 0.79560142, 0.76440134, 0.75192137, 0.74256115, 0.83616142, 0.81120152, 0.77376142, 0.75260158, 0.74256115, 0.73320136]), 'mean_score_time': array([ 0.01524043, 0.01424046, 0.01248012, 0.01248002, 0.01559997, 0.01248002, 0.01247997, 0.01559992, 0.00935984, 0.0156002 , 0.01560006, 0.01248002, 0.01248002, 0.01248012, 0.01248002, 0.01348019, 0.01247997, 0.01560001, 0.01560016, 0.00936007, 0.00935998, 0.01560001, 0.01248002, 0.00624003, 0.01248012, 0.00935988, 0.01559997, 0.01248007, 0.01248016, 0.01559992]), 'mean_test_score': array([ 0.81827919, 0.81731255, 0.8222033 , 0.8144698 , 0.81495371, 0.81528439, 0.81590487, 0.81572702, 0.82021405, 0.81512044, 0.81395095, 0.81586914, 0.82063564, 0.81489972, 0.82009218, 0.81850308, 0.81855231, 0.81665754, 0.81960231, 0.81560198, 0.81936055, 0.81361749, 0.81429473, 0.81299027, 0.81999889, 0.82209294, 0.81820535, 0.81921804, 0.81544596, 0.8170426 ]), 'mean_train_score': array([ 0.92087206, 0.91440802, 0.91027269, 0.90526927, 0.90050771, 0.89715931, 0.91948312, 0.91372936, 0.91031055, 0.90391826, 0.90025277, 0.89628226, 0.91950803, 0.91364818, 0.90869026, 0.90336797, 0.89789352, 0.89484074, 0.91861416, 0.91142069, 0.90642938, 0.9016713 , 0.89810865, 0.89368284, 0.91955286, 0.91187216, 0.9060234 , 0.90194315, 0.89785241, 0.89320016]), 'param_min_samples_leaf': masked_array(data = [60 60 60 60 60 60 70 70 70 70 70 70 80 80 80 80 80 80 90 90 90 90 90 90 100 100 100 100 100 100], mask = [False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False], fill_value = ?), 'param_min_samples_split': masked_array(data = [800 1000 1200 1400 1600 1800 800 1000 1200 1400 1600 1800 800 1000 1200 1400 1600 1800 800 1000 1200 1400 1600 1800 800 1000 1200 1400 1600 1800], mask = [False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False], fill_value = ?), 'params': [{'min_samples_leaf': 60, 'min_samples_split': 800}, {'min_samples_leaf': 60, 'min_samples_split': 1000}, {'min_samples_leaf': 60, 'min_samples_split': 1200}, {'min_samples_leaf': 60, 'min_samples_split': 1400}, {'min_samples_leaf': 60, 'min_samples_split': 1600}, {'min_samples_leaf': 60, 'min_samples_split': 1800}, {'min_samples_leaf': 70, 'min_samples_split': 800}, {'min_samples_leaf': 70, 'min_samples_split': 1000}, {'min_samples_leaf': 70, 'min_samples_split': 1200}, {'min_samples_leaf': 70, 'min_samples_split': 1400}, {'min_samples_leaf': 70, 'min_samples_split': 1600}, {'min_samples_leaf': 70, 'min_samples_split': 1800}, {'min_samples_leaf': 80, 'min_samples_split': 800}, {'min_samples_leaf': 80, 'min_samples_split': 1000}, {'min_samples_leaf': 80, 'min_samples_split': 1200}, {'min_samples_leaf': 80, 'min_samples_split': 1400}, {'min_samples_leaf': 80, 'min_samples_split': 1600}, {'min_samples_leaf': 80, 'min_samples_split': 1800}, {'min_samples_leaf': 90, 'min_samples_split': 800}, {'min_samples_leaf': 90, 'min_samples_split': 1000}, {'min_samples_leaf': 90, 'min_samples_split': 1200}, {'min_samples_leaf': 90, 'min_samples_split': 1400}, {'min_samples_leaf': 90, 'min_samples_split': 1600}, {'min_samples_leaf': 90, 'min_samples_split': 1800}, {'min_samples_leaf': 100, 'min_samples_split': 800}, {'min_samples_leaf': 100, 'min_samples_split': 1000}, {'min_samples_leaf': 100, 'min_samples_split': 1200}, {'min_samples_leaf': 100, 'min_samples_split': 1400}, {'min_samples_leaf': 100, 'min_samples_split': 1600}, {'min_samples_leaf': 100, 'min_samples_split': 1800}], 'rank_test_score': array([12, 14, 1, 26, 24, 22, 17, 19, 4, 23, 28, 18, 3, 25, 5, 11, 10, 16, 7, 20, 8, 29, 27, 30, 6, 2, 13, 9, 21, 15]), 'split0_test_score': array([ 0.81074338, 0.81265085, 0.82638029, 0.80854611, 0.81629907, 0.81756542, 0.81436579, 0.81356787, 0.81387751, 0.8167417 , 0.81225586, 0.8124603 , 0.82038396, 0.81525303, 0.82509011, 0.82155107, 0.82085437, 0.81130907, 0.8141812 , 0.81352023, 0.82424654, 0.80720036, 0.82315088, 0.81182117, 0.81980437, 0.82184681, 0.81924265, 0.81771429, 0.81448687, 0.81808943]), 'split0_train_score': array([ 0.92057391, 0.91292057, 0.91244929, 0.90219836, 0.90016571, 0.90358281, 0.91885723, 0.91618161, 0.9070643 , 0.90089119, 0.89924696, 0.89898074, 0.91949872, 0.91556109, 0.9079606 , 0.89851702, 0.89702054, 0.89789476, 0.91984 , 0.91114658, 0.90538236, 0.90245143, 0.89943552, 0.89723019, 0.92114394, 0.91206199, 0.90619591, 0.90022228, 0.89914672, 0.89662493]), 'split1_test_score': array([ 0.7951283 , 0.79233756, 0.80273636, 0.79005891, 0.80018579, 0.79693852, 0.79603341, 0.79458048, 0.79516999, 0.795168 , 0.79620808, 0.80128938, 0.80107303, 0.79358208, 0.79414579, 0.80020762, 0.79499532, 0.79650383, 0.8008408 , 0.79762529, 0.79116052, 0.7969663 , 0.78846704, 0.79792897, 0.80096981, 0.80635877, 0.79628747, 0.79954864, 0.79475316, 0.79569002]), 'split1_train_score': array([ 0.92293766, 0.91419474, 0.91336419, 0.91015935, 0.9062108 , 0.90082395, 0.9190685 , 0.91656581, 0.91922258, 0.9111771 , 0.90326077, 0.90058762, 0.91803512, 0.91581515, 0.91237758, 0.90895974, 0.90145415, 0.89436117, 0.9224696 , 0.91450935, 0.90733945, 0.90298846, 0.89767568, 0.89760633, 0.91678551, 0.91257843, 0.90944442, 0.90616998, 0.90128072, 0.89712351]), 'split2_test_score': array([ 0.79557887, 0.79529702, 0.79065239, 0.79493577, 0.79307792, 0.78586882, 0.78244688, 0.78915976, 0.79308983, 0.79021969, 0.78667866, 0.78708953, 0.78209159, 0.7814465 , 0.78694264, 0.78666079, 0.79528709, 0.79119228, 0.79108113, 0.78673423, 0.79084691, 0.78443772, 0.78295104, 0.78006701, 0.78457865, 0.79912983, 0.78818915, 0.78846108, 0.78831817, 0.78356636]), 'split2_train_score': array([ 0.92102547, 0.91630368, 0.9102349 , 0.90583752, 0.90075758, 0.89876488, 0.91873529, 0.90970829, 0.91140611, 0.90462624, 0.90124078, 0.89356908, 0.9246712 , 0.91307055, 0.9083247 , 0.90967033, 0.90187854, 0.89756452, 0.92094285, 0.90886 , 0.91007202, 0.90490053, 0.89955735, 0.89381644, 0.91946151, 0.91229322, 0.91028452, 0.90675788, 0.90101785, 0.89461734]), 'split3_test_score': array([ 0.84421645, 0.83124127, 0.84106842, 0.83822806, 0.83325394, 0.83313881, 0.8347823 , 0.83651312, 0.84387108, 0.83251358, 0.83471084, 0.83596727, 0.84400803, 0.83999659, 0.84291039, 0.83963732, 0.83339883, 0.83359931, 0.83206102, 0.83171367, 0.83399231, 0.84131058, 0.83491925, 0.83320035, 0.84262656, 0.83396651, 0.83851983, 0.83680489, 0.83238456, 0.83610026]), 'split3_train_score': array([ 0.91985712, 0.91278858, 0.90916989, 0.90917981, 0.8996649 , 0.89287946, 0.92095439, 0.91663491, 0.90861548, 0.90378068, 0.90020727, 0.894334 , 0.9203521 , 0.91438653, 0.90699024, 0.90188499, 0.89512101, 0.8920427 , 0.91853568, 0.91232387, 0.90701095, 0.90172695, 0.89739916, 0.88956123, 0.92465557, 0.91743494, 0.90251656, 0.90173427, 0.89879403, 0.89120223]), 'split4_test_score': array([ 0.84572893, 0.85503605, 0.85017904, 0.84058014, 0.83195185, 0.84291039, 0.85189596, 0.8448139 , 0.85506185, 0.84095925, 0.83990131, 0.84253922, 0.85562159, 0.84422042, 0.85137195, 0.8444586 , 0.84822591, 0.8506832 , 0.8598474 , 0.84841646, 0.85655647, 0.83817248, 0.84198544, 0.84193383, 0.85201505, 0.84916278, 0.84878763, 0.85356128, 0.84728706, 0.85176694]), 'split4_train_score': array([ 0.91996616, 0.91583252, 0.90614517, 0.89897131, 0.89573955, 0.88974545, 0.91980018, 0.9095562 , 0.90524428, 0.89911608, 0.8973081 , 0.89393988, 0.91498299, 0.90940758, 0.90779821, 0.8978078 , 0.89399335, 0.89234056, 0.91128267, 0.91026368, 0.90234214, 0.89628911, 0.89647557, 0.89019999, 0.91571777, 0.9049922 , 0.90167559, 0.89483134, 0.8890227 , 0.8864328 ]), 'std_fit_time': array([ 3.07595384e-02, 4.92428119e-02, 9.86628483e-03, 9.86636022e-03, 8.46375555e-03, 1.52848801e-02, 9.86620943e-03, 7.64233297e-03, 2.68393017e-02, 3.65854644e-02, 1.59089904e-02, 1.88016029e-02, 6.24008179e-03, 1.15942915e-02, 9.86636022e-03, 5.86673508e-03, 5.50535106e-03, 5.95249423e-03, 1.23117482e-02, 1.16739329e-02, 9.53674316e-08, 9.86636022e-03, 2.29272195e-02, 1.24799252e-02, 7.64250817e-03, 9.86636022e-03, 7.64239137e-03, 6.04519629e-03, 7.64250817e-03, 9.86628483e-03]), 'std_score_time': array([ 1.13410486e-03, 7.49479274e-03, 6.24005795e-03, 6.24001026e-03, 1.50789149e-07, 6.24001026e-03, 6.23998642e-03, 1.78416128e-07, 7.64227456e-03, 0.00000000e+00, 9.86628483e-03, 6.24001026e-03, 6.24001026e-03, 6.24005795e-03, 6.24001026e-03, 4.23979759e-03, 6.23998642e-03, 2.33601546e-07, 9.53674316e-08, 7.64246923e-03, 7.64239137e-03, 1.78416128e-07, 6.24001026e-03, 7.64244977e-03, 6.24005795e-03, 7.64231350e-03, 1.50789149e-07, 6.24003410e-03, 6.24008179e-03, 9.53674316e-08]), 'std_test_score': array([ 0.02251349, 0.02344029, 0.02249624, 0.02125448, 0.01626215, 0.02139638, 0.02517299, 0.02207155, 0.02520756, 0.01995466, 0.02081276, 0.02082155, 0.02697661, 0.02475173, 0.0256756 , 0.02226338, 0.02098782, 0.02248567, 0.024371 , 0.02234824, 0.02541561, 0.02253777, 0.0241664 , 0.02262002, 0.02511487, 0.01815867, 0.02336834, 0.02376507, 0.02220721, 0.0250865 ]), 'std_train_score': array([ 0.00111623, 0.00144937, 0.00255143, 0.00421006, 0.00335113, 0.00510982, 0.00082317, 0.00334919, 0.00489291, 0.00413364, 0.0019854 , 0.00291416, 0.0031628 , 0.00233309, 0.00189464, 0.00505241, 0.00323168, 0.00249223, 0.00388707, 0.0019145 , 0.00253918, 0.00288938, 0.00120143, 0.00337974, 0.00319198, 0.00397468, 0.00349547, 0.00435042, 0.00452325, 0.00397288])}, {'min_samples_leaf': 60, 'min_samples_split': 1200}, 0.82220329966971539)
- 把调节的参数放到GBDT类里面去看效果。现在我们用新参数拟合数据:
In [8]:
gbm1 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=7, min_samples_leaf =60, min_samples_split =1200, max_features='sqrt', subsample=0.8, random_state=10) gbm1.fit(X,y) y_pred = gbm1.predict(X) y_predprob = gbm1.predict_proba(X)[:,1] print("Accuracy : %.4g" % metrics.accuracy_score(y.values, y_pred)) print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
Accuracy : 0.984 AUC Score (Train): 0.908099
- 对比我们最开始完全不调参的拟合效果,可见精确度稍有下降,主要原理是我们使用了0.8的子采样,20%的数据没有参与拟合。
- 我们再对最大特征数max_features进行网格搜索:
In [9]:
param_test4 = {'max_features':list(range(7,20,2))} gsearch4 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=7, min_samples_leaf =60, min_samples_split =1200, subsample=0.8, random_state=10), param_grid = param_test4, scoring='roc_auc',iid=False, cv=5) gsearch4.fit(X,y) gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_
Out[9]:
({'mean_fit_time': array([ 0.8245635 , 1.01770296, 1.08365922, 1.13576221, 1.25328522, 1.33576236, 1.4449626 ]), 'mean_score_time': array([ 0.01247993, 0.01312051, 0.01488018, 0.01248002, 0.01560011, 0.00935998, 0.01248012]), 'mean_test_score': array([ 0.8222033 , 0.82241251, 0.82108184, 0.82064239, 0.82198258, 0.81354802, 0.81876866]), 'mean_train_score': array([ 0.91027269, 0.91158373, 0.91466399, 0.91573707, 0.91639903, 0.91679107, 0.91668619]), 'param_max_features': masked_array(data = [7 9 11 13 15 17 19], mask = [False False False False False False False], fill_value = ?), 'params': [{'max_features': 7}, {'max_features': 9}, {'max_features': 11}, {'max_features': 13}, {'max_features': 15}, {'max_features': 17}, {'max_features': 19}], 'rank_test_score': array([2, 1, 4, 5, 3, 7, 6]), 'split0_test_score': array([ 0.82638029, 0.81711088, 0.82152129, 0.81253771, 0.81923074, 0.80582087, 0.81814501]), 'split0_train_score': array([ 0.91244929, 0.91012226, 0.91923163, 0.91558763, 0.92099594, 0.9204073 , 0.91439534]), 'split1_test_score': array([ 0.80273636, 0.79887775, 0.79841328, 0.80308768, 0.80473315, 0.79439787, 0.79727992]), 'split1_train_score': array([ 0.91336419, 0.91048524, 0.91247397, 0.91787496, 0.91871246, 0.9160614 , 0.9178571 ]), 'split2_test_score': array([ 0.79065239, 0.7965336 , 0.79322877, 0.80307974, 0.80786927, 0.79174209, 0.79920525]), 'split2_train_score': array([ 0.9102349 , 0.91210665, 0.91137162, 0.9146365 , 0.91232176, 0.91599317, 0.91339322]), 'split3_test_score': array([ 0.84106842, 0.84023477, 0.83887513, 0.83253343, 0.83357747, 0.8346513 , 0.83625905]), 'split3_train_score': array([ 0.90916989, 0.90980778, 0.91932914, 0.91526199, 0.91442685, 0.91806489, 0.92131489]), 'split4_test_score': array([ 0.85017904, 0.85930553, 0.85337073, 0.85197337, 0.84450227, 0.84112797, 0.84295406]), 'split4_train_score': array([ 0.90614517, 0.91539671, 0.91091361, 0.91532427, 0.91553814, 0.91342858, 0.91647041]), 'std_fit_time': array([ 0.01370075, 0.04088774, 0.07686664, 0.01820694, 0.02534589, 0.01270362, 0.01270356]), 'std_score_time': array([ 6.23996258e-03, 8.00890142e-03, 1.43980980e-03, 6.24001026e-03, 1.16800773e-07, 7.64239137e-03, 6.24005795e-03]), 'std_test_score': array([ 0.02249624, 0.02420925, 0.02301749, 0.01900172, 0.01513855, 0.0205326 , 0.01863185]), 'std_train_score': array([ 0.00255143, 0.00206441, 0.00380337, 0.00111358, 0.00308993, 0.00233132, 0.00279049])}, {'max_features': 9}, 0.82241250635162599)
- 们再对子采样的比例进行网格搜索:
In [10]:
param_test5 = {'subsample':[0.6,0.7,0.75,0.8,0.85,0.9]} gsearch5 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=7, min_samples_leaf =60, min_samples_split =1200, max_features=9, random_state=10), param_grid = param_test5, scoring='roc_auc',iid=False, cv=5) gsearch5.fit(X,y) gsearch5.cv_results_, gsearch5.best_params_, gsearch5.best_score_
Out[10]:
({'mean_fit_time': array([ 0.83184319, 0.88940153, 0.90480151, 0.91104164, 0.90820169, 0.91936193]), 'mean_score_time': array([ 0.00935998, 0.00936003, 0.01560016, 0.01248007, 0.01248007, 0.00935998]), 'mean_test_score': array([ 0.8182768 , 0.8234379 , 0.81673217, 0.82241251, 0.8228468 , 0.81738003]), 'mean_train_score': array([ 0.8997337 , 0.90444219, 0.90921547, 0.91158373, 0.91429419, 0.91560448]), 'param_subsample': masked_array(data = [0.6 0.7 0.75 0.8 0.85 0.9], mask = [False False False False False False], fill_value = ?), 'params': [{'subsample': 0.6}, {'subsample': 0.7}, {'subsample': 0.75}, {'subsample': 0.8}, {'subsample': 0.85}, {'subsample': 0.9}], 'rank_test_score': array([4, 1, 6, 3, 2, 5]), 'split0_test_score': array([ 0.81290492, 0.82635051, 0.81494141, 0.81711088, 0.82498491, 0.81622563]), 'split0_train_score': array([ 0.89892342, 0.90458059, 0.91068262, 0.91012226, 0.91387183, 0.91733049]), 'split1_test_score': array([ 0.79736527, 0.80361963, 0.79503303, 0.79887775, 0.79882614, 0.78725229]), 'split1_train_score': array([ 0.90234884, 0.91528494, 0.91023316, 0.91048524, 0.91739785, 0.9182774 ]), 'split2_test_score': array([ 0.79080721, 0.78665087, 0.79387187, 0.7965336 , 0.79332404, 0.80056093]), 'split2_train_score': array([ 0.9047034 , 0.9039783 , 0.90889014, 0.91210665, 0.91532365, 0.91634698]), 'split3_test_score': array([ 0.8352805 , 0.83492719, 0.82686658, 0.84023477, 0.83811492, 0.83271405]), 'split3_train_score': array([ 0.89647656, 0.89508305, 0.9073727 , 0.90980778, 0.91132584, 0.91255982]), 'split4_test_score': array([ 0.85502612, 0.86564128, 0.85294795, 0.85930553, 0.85898398, 0.85014728]), 'split4_train_score': array([ 0.89621629, 0.90328409, 0.9088987 , 0.91539671, 0.91355176, 0.91350773]), 'std_fit_time': array([ 0.00880864, 0.01373345, 0.01708891, 0.01590903, 0.01735848, 0.01942842]), 'std_score_time': array([ 7.64239137e-03, 7.64243030e-03, 9.53674316e-08, 6.24003411e-03, 6.24003410e-03, 7.64239137e-03]), 'std_test_score': array([ 0.02391805, 0.0270838 , 0.02195879, 0.02420925, 0.0244629 , 0.0223639 ]), 'std_train_score': array([ 0.00332188, 0.00643015, 0.00116535, 0.00206441, 0.00201162, 0.00220641])}, {'subsample': 0.7}, 0.82343789697662617)
- 现在我们基本已经得到我们所有调优的参数结果了。这时我们可以减半步长,最大迭代次数加倍来增加我们模型的泛化能力。再次拟合我们的模型:
In [11]:
gbm2 = GradientBoostingClassifier(learning_rate=0.05, n_estimators=120,max_depth=7, min_samples_leaf =60, min_samples_split =1200, max_features=9, subsample=0.7, random_state=10) gbm2.fit(X,y) y_pred = gbm2.predict(X) y_predprob = gbm2.predict_proba(X)[:,1] print("Accuracy : %.4g" % metrics.accuracy_score(y.values, y_pred)) print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
Accuracy : 0.984 AUC Score (Train): 0.905324
-
可以看到AUC分数比起之前的版本稍有下降,这个原因是我们为了增加模型泛化能力,为防止过拟合而减半步长,最大迭代次数加倍,同时减小了子采样的比例,从而减少了训练集的拟合程度。
-
下面我们继续将步长缩小5倍,最大迭代次数增加5倍,继续拟合我们的模型:
In [12]:
gbm3 = GradientBoostingClassifier(learning_rate=0.01, n_estimators=600,max_depth=7, min_samples_leaf =60, min_samples_split =1200, max_features=9, subsample=0.7, random_state=10) gbm3.fit(X,y) y_pred = gbm3.predict(X) y_predprob = gbm3.predict_proba(X)[:,1] print("Accuracy : %.4g" % metrics.accuracy_score(y.values, y_pred)) print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
Accuracy : 0.984 AUC Score (Train): 0.908581
- 最后我们继续步长缩小一半,最大迭代次数增加2倍,拟合我们的模型:
In [13]:
gbm4 = GradientBoostingClassifier(learning_rate=0.005, n_estimators=1200,max_depth=7, min_samples_leaf =60, min_samples_split =1200, max_features=9, subsample=0.7, random_state=10) gbm4.fit(X,y) y_pred = gbm4.predict(X) y_predprob = gbm4.predict_proba(X)[:,1] print("Accuracy : %.4g" % metrics.accuracy_score(y.values, y_pred)) print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
Accuracy : 0.984 AUC Score (Train): 0.908232
随机森林RF参数调优
GBDT的框架参数比较多,重要的有最大迭代器个数,步长和子采样比例,调参起来比较费力。但是RF则比较简单,这是因为bagging框架里的各个弱学习器之间是没有依赖关系的,这减小的调参的难度。下面我来看看RF重要的Bagging框架的参数,由于RandomForestClassifier和RandomForestRegressor参数绝大部分相同。
- 框架参数
- n_estimators: 弱学习器的最大迭代次数,或者说最大的弱学习器的个数。
- oob_score:即是否采用袋外样本来评估模型的好坏。默认识False。个人推荐设置为True,因为袋外分数反应了一个模型拟合后的泛化能力。
-
criterion: 即CART树做划分时对特征的评价标准。分类模型和回归模型的损失函数是不一样的。分类RF对应的CART分类树默认是基尼系数gini,另一个可选择的标准是信息增益。回归RF对应的CART回归树默认是均方差mse,另一个可以选择的标准是绝对值差mae。一般来说选择默认的标准就已经很好的。
-
RF重要的框架参数比较少,主要需要关注的是 n_estimators,即RF最大的决策树个数
- 决策树参数
- max_features: RF划分时考虑的最大特征数
- max_depth: 决策树最大深度
- min_samples_split: 内部节点再划分所需最小样本数
- min_samples_leaf: 叶子节点最少样本数
- min_weight_fraction_leaf: 叶子节点最小的样本权重和
- max_leaf_nodes: 最大叶子节点数
-
min_impurity_split: 节点划分最小不纯度
-
上面决策树参数中最重要的包括最大特征数max_features, 最大深度max_depth, 内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf。
- 加载类库
In [14]:
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV from sklearn import cross_validation, metrics import matplotlib.pylab as plt %matplotlib inline
- 用pandas观察数据
In [15]:
train = pd.read_csv('train_modified.csv') target = 'Disbursed' IDcol = 'ID' train['Disbursed'].value_counts()
Out[15]:
0 19680 1 320 Name: Disbursed, dtype: int64
- 选择训练样本和输出
In [16]:
x_columns = [x for x in train.columns if x not in[target, IDcol]] X = train[x_columns] y = train['Disbursed']
- 默认参数
In [17]:
rf0 = RandomForestClassifier(oob_score=True, random_state=10) rf0.fit(X,y) print(rf0.oob_score_) y_predprob = rf0.predict_proba(X)[:,1] print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
0.98005 AUC Score (Train): 0.999833
- 对n_estimators进行网格搜索
In [18]:
param_test1 = {'n_estimators':list(range(10,71,10))} gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100, min_samples_leaf=20,max_depth=8,max_features='sqrt' ,random_state=10), param_grid = param_test1, scoring='roc_auc',cv=5) gsearch1.fit(X,y) gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_
Out[18]:
({'mean_fit_time': array([ 0.10284119, 0.18108449, 0.24648037, 0.37708621, 0.42704334, 0.50544086, 0.55536098]), 'mean_score_time': array([ 0.00623999, 0.01456041, 0.01560016, 0.01508017, 0.02204003, 0.03120017, 0.03431997]), 'mean_test_score': array([ 0.80680934, 0.81600252, 0.81818272, 0.81838438, 0.82034069, 0.82113345, 0.8199191 ]), 'mean_train_score': array([ 0.8902114 , 0.89959868, 0.90359284, 0.90555378, 0.90597112, 0.90670245, 0.90710504]), 'param_n_estimators': masked_array(data = [10 20 30 40 50 60 70], mask = [False False False False False False False], fill_value = ?), 'params': [{'n_estimators': 10}, {'n_estimators': 20}, {'n_estimators': 30}, {'n_estimators': 40}, {'n_estimators': 50}, {'n_estimators': 60}, {'n_estimators': 70}], 'rank_test_score': array([7, 6, 5, 4, 2, 1, 3]), 'split0_test_score': array([ 0.81797431, 0.82673558, 0.8370927 , 0.83676321, 0.8351753 , 0.83643769, 0.83286093]), 'split0_train_score': array([ 0.88936373, 0.89866452, 0.9022285 , 0.90198213, 0.90226423, 0.90337899, 0.90328248]), 'split1_test_score': array([ 0.78064461, 0.78217893, 0.79100967, 0.79112479, 0.7911367 , 0.7932903 , 0.79317319]), 'split1_train_score': array([ 0.89679191, 0.90442019, 0.90866399, 0.91072405, 0.90980517, 0.91011506, 0.91099983]), 'split2_test_score': array([ 0.77967996, 0.77394166, 0.7725582 , 0.77300678, 0.77952514, 0.77912022, 0.7801603 ]), 'split2_train_score': array([ 0.89451909, 0.9047823 , 0.90772365, 0.9090921 , 0.90858509, 0.90862169, 0.90853708]), 'split3_test_score': array([ 0.82203538, 0.83827172, 0.83311103, 0.83438929, 0.83691605, 0.84013156, 0.83880566]), 'split3_train_score': array([ 0.88717552, 0.89682863, 0.90333111, 0.90672588, 0.90813179, 0.90888977, 0.9096568 ]), 'split4_test_score': array([ 0.83371245, 0.85888473, 0.85714201, 0.85663785, 0.85895024, 0.85668747, 0.8545954 ]), 'split4_train_score': array([ 0.88320675, 0.89329777, 0.89601694, 0.89924473, 0.90106933, 0.90250676, 0.903049 ]), 'std_fit_time': array([ 0.01848207, 0.00961718, 0.00623996, 0.08223287, 0.02709142, 0.04368012, 0.01872001]), 'std_score_time': array([ 7.64239137e-03, 1.27347883e-03, 9.53674316e-08, 1.03971960e-03, 7.48802833e-03, 2.13248060e-07, 6.24008179e-03]), 'std_test_score': array([ 0.02236454, 0.03275104, 0.03136316, 0.03117524, 0.03001429, 0.02966341, 0.02836457]), 'std_train_score': array([ 0.00491649, 0.00443541, 0.00451895, 0.00431708, 0.00357687, 0.00312291, 0.00331044])}, {'n_estimators': 60}, 0.82113344766260166)
- 对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索。
In [19]:
param_test2 = {'max_depth':list(range(3,14,2)), 'min_samples_split':list(range(50,201,20))} gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10), param_grid = param_test2, scoring='roc_auc',iid=False, cv=5) gsearch2.fit(X,y) gsearch2.cv_results_, gsearch2.best_params_, gsearch2.best_score_
Out[19]:
({'mean_fit_time': array([ 0.45684676, 0.4342411 , 0.43368082, 0.43680077, 0.43680086, 0.44928074, 0.41808085, 0.43368087, 0.55564117, 0.52104096, 0.53120098, 0.5085608 , 0.52104096, 0.52728086, 0.53808246, 0.51344137, 0.57720103, 0.59008093, 0.59904108, 0.58404093, 0.57408099, 0.57720103, 0.570961 , 0.57408099, 0.636481 , 0.63856139, 0.64724126, 0.63044109, 0.63024101, 0.63648114, 0.6333611 , 0.63024116, 0.67080116, 0.6618412 , 0.65520101, 0.6552012 , 0.66456118, 0.65832114, 0.65832114, 0.63648109, 0.68952127, 0.69404125, 0.67724113, 0.67704124, 0.65520124, 0.66456118, 0.65520124, 0.65520105]), 'mean_score_time': array([ 0.01808033, 0.02496004, 0.02184005, 0.01560006, 0.01872001, 0.02496014, 0.02808013, 0.0218399 , 0.02496009, 0.02807999, 0.02183995, 0.02808013, 0.02495999, 0.01872001, 0.02496004, 0.03116035, 0.03120012, 0.03120017, 0.02808003, 0.02184014, 0.03120008, 0.02184005, 0.03120003, 0.02808008, 0.03120012, 0.03120012, 0.02807999, 0.03432021, 0.03120008, 0.02495995, 0.02807999, 0.02808003, 0.03120003, 0.03120003, 0.03120027, 0.03120003, 0.03120008, 0.03120022, 0.03120003, 0.03120017, 0.03432002, 0.03744001, 0.03120017, 0.03744006, 0.03120003, 0.03120003, 0.03120003, 0.03120012]), 'mean_test_score': array([ 0.79379248, 0.79338637, 0.79350308, 0.79366624, 0.79387425, 0.79372817, 0.7937758 , 0.79349474, 0.80960366, 0.80919874, 0.80887759, 0.80922971, 0.80822853, 0.80801019, 0.80792286, 0.80771167, 0.81687627, 0.81872459, 0.81501127, 0.81475522, 0.81557061, 0.81458651, 0.81601007, 0.81703824, 0.82090439, 0.81907989, 0.82035736, 0.81889291, 0.81991314, 0.81787625, 0.81897588, 0.81745943, 0.82395238, 0.82380272, 0.81952728, 0.82253557, 0.81950267, 0.8188687 , 0.81910371, 0.81563969, 0.82290754, 0.82176623, 0.82415365, 0.82420168, 0.8220854 , 0.81852491, 0.81954594, 0.82092345]), 'mean_train_score': array([ 0.82384314, 0.82333406, 0.82317996, 0.82334009, 0.82278031, 0.82227885, 0.82230433, 0.82203672, 0.86685836, 0.866551 , 0.86475867, 0.8633599 , 0.86141916, 0.86077546, 0.86043597, 0.85972952, 0.9001969 , 0.89648326, 0.89387619, 0.89111177, 0.89000505, 0.88648389, 0.88449675, 0.88333465, 0.92445158, 0.91957807, 0.91649202, 0.91247023, 0.90955617, 0.90597963, 0.90288312, 0.90016919, 0.9396342 , 0.93361663, 0.92901996, 0.92452097, 0.9203458 , 0.91649912, 0.91281736, 0.90914974, 0.94742174, 0.94158583, 0.93517471, 0.93039251, 0.92589124, 0.92226143, 0.91770255, 0.91470253]), 'param_max_depth': masked_array(data = [3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 11 11 11 11 11 11 11 11 13 13 13 13 13 13 13 13], mask = [False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False], fill_value = ?), 'param_min_samples_split': masked_array(data = [50 70 90 110 130 150 170 190 50 70 90 110 130 150 170 190 50 70 90 110 130 150 170 190 50 70 90 110 130 150 170 190 50 70 90 110 130 150 170 190 50 70 90 110 130 150 170 190], mask = [False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False], fill_value = ?), 'params': [{'max_depth': 3, 'min_samples_split': 50}, {'max_depth': 3, 'min_samples_split': 70}, {'max_depth': 3, 'min_samples_split': 90}, {'max_depth': 3, 'min_samples_split': 110}, {'max_depth': 3, 'min_samples_split': 130}, {'max_depth': 3, 'min_samples_split': 150}, {'max_depth': 3, 'min_samples_split': 170}, {'max_depth': 3, 'min_samples_split': 190}, {'max_depth': 5, 'min_samples_split': 50}, {'max_depth': 5, 'min_samples_split': 70}, {'max_depth': 5, 'min_samples_split': 90}, {'max_depth': 5, 'min_samples_split': 110}, {'max_depth': 5, 'min_samples_split': 130}, {'max_depth': 5, 'min_samples_split': 150}, {'max_depth': 5, 'min_samples_split': 170}, {'max_depth': 5, 'min_samples_split': 190}, {'max_depth': 7, 'min_samples_split': 50}, {'max_depth': 7, 'min_samples_split': 70}, {'max_depth': 7, 'min_samples_split': 90}, {'max_depth': 7, 'min_samples_split': 110}, {'max_depth': 7, 'min_samples_split': 130}, {'max_depth': 7, 'min_samples_split': 150}, {'max_depth': 7, 'min_samples_split': 170}, {'max_depth': 7, 'min_samples_split': 190}, {'max_depth': 9, 'min_samples_split': 50}, {'max_depth': 9, 'min_samples_split': 70}, {'max_depth': 9, 'min_samples_split': 90}, {'max_depth': 9, 'min_samples_split': 110}, {'max_depth': 9, 'min_samples_split': 130}, {'max_depth': 9, 'min_samples_split': 150}, {'max_depth': 9, 'min_samples_split': 170}, {'max_depth': 9, 'min_samples_split': 190}, {'max_depth': 11, 'min_samples_split': 50}, {'max_depth': 11, 'min_samples_split': 70}, {'max_depth': 11, 'min_samples_split': 90}, {'max_depth': 11, 'min_samples_split': 110}, {'max_depth': 11, 'min_samples_split': 130}, {'max_depth': 11, 'min_samples_split': 150}, {'max_depth': 11, 'min_samples_split': 170}, {'max_depth': 11, 'min_samples_split': 190}, {'max_depth': 13, 'min_samples_split': 50}, {'max_depth': 13, 'min_samples_split': 70}, {'max_depth': 13, 'min_samples_split': 90}, {'max_depth': 13, 'min_samples_split': 110}, {'max_depth': 13, 'min_samples_split': 130}, {'max_depth': 13, 'min_samples_split': 150}, {'max_depth': 13, 'min_samples_split': 170}, {'max_depth': 13, 'min_samples_split': 190}], 'rank_test_score': array([42, 48, 46, 45, 41, 44, 43, 47, 33, 35, 36, 34, 37, 38, 39, 40, 26, 21, 30, 31, 29, 32, 27, 25, 10, 17, 11, 19, 12, 23, 18, 24, 3, 4, 14, 6, 15, 20, 16, 28, 5, 8, 2, 1, 7, 22, 13, 9]), 'split0_test_score': array([ 0.80645405, 0.80638458, 0.80678949, 0.80734923, 0.80778193, 0.80785339, 0.80785339, 0.80702768, 0.82106477, 0.81818868, 0.81633281, 0.82047328, 0.81733319, 0.81232732, 0.81567383, 0.81771429, 0.83755518, 0.82971489, 0.82488567, 0.82656488, 0.83121348, 0.82325807, 0.82694399, 0.82772008, 0.82574711, 0.82930998, 0.8262215 , 0.82408973, 0.83039769, 0.82074322, 0.82633464, 0.8295164 , 0.83015554, 0.82926631, 0.81959794, 0.82797018, 0.83578268, 0.82410363, 0.82615401, 0.82353794, 0.82335731, 0.82299606, 0.83206102, 0.82738464, 0.82751763, 0.81720219, 0.83011981, 0.82965931]), 'split0_train_score': array([ 0.8199056 , 0.81961866, 0.81968342, 0.81994617, 0.82035455, 0.81904242, 0.81904242, 0.81833121, 0.86326425, 0.86446511, 0.8613383 , 0.86196961, 0.85909538, 0.85840725, 0.85987259, 0.85879901, 0.90103087, 0.89808431, 0.89313426, 0.89194805, 0.89041857, 0.88675076, 0.885504 , 0.88236354, 0.92267106, 0.91591973, 0.91719998, 0.9107702 , 0.90953684, 0.90563121, 0.90022638, 0.90042958, 0.93913232, 0.93375328, 0.92778611, 0.92410365, 0.9194728 , 0.91815124, 0.91254468, 0.90882737, 0.94684483, 0.94073127, 0.93489 , 0.93095398, 0.92845861, 0.92187996, 0.91752984, 0.91488685]), 'split1_test_score': array([ 0.77140696, 0.77074203, 0.77089288, 0.77044827, 0.77052369, 0.77080158, 0.7706666 , 0.77084524, 0.78410029, 0.78509472, 0.78282004, 0.7806585 , 0.78206182, 0.78237146, 0.78104556, 0.78182363, 0.78862583, 0.79442565, 0.79122999, 0.79462613, 0.79867727, 0.7938679 , 0.79050948, 0.79537443, 0.79954268, 0.79698417, 0.79467575, 0.79319701, 0.7937111 , 0.79515411, 0.79755383, 0.79668445, 0.79763124, 0.8048979 , 0.80675376, 0.80675178, 0.79414182, 0.79613861, 0.79411601, 0.79871697, 0.80570178, 0.80474903, 0.8041754 , 0.80042199, 0.79921915, 0.80355215, 0.79351856, 0.80418334]), 'split1_train_score': array([ 0.83115046, 0.83001907, 0.82960895, 0.83032388, 0.82878572, 0.82801918, 0.82797688, 0.82815738, 0.87250686, 0.87225801, 0.8691338 , 0.86591556, 0.86598987, 0.86498713, 0.86456733, 0.86448632, 0.89966813, 0.89834483, 0.89620724, 0.8928833 , 0.89248645, 0.88867845, 0.88690483, 0.8861342 , 0.92438203, 0.91925223, 0.91333057, 0.91296672, 0.90777724, 0.90387062, 0.90171839, 0.89864579, 0.93922375, 0.93070835, 0.93060141, 0.92420625, 0.9191438 , 0.91688649, 0.91232734, 0.9098672 , 0.94586095, 0.94214549, 0.93541575, 0.92989318, 0.92485791, 0.9238115 , 0.9197245 , 0.91617652]), 'split2_test_score': array([ 0.76001969, 0.75840995, 0.75735796, 0.75728849, 0.75699076, 0.75644293, 0.75645087, 0.75569264, 0.77379279, 0.77263362, 0.77588089, 0.77176027, 0.77225649, 0.77308618, 0.77190716, 0.77227237, 0.77339185, 0.78255407, 0.7725185 , 0.77543628, 0.76869561, 0.76960072, 0.77574393, 0.77508694, 0.78276844, 0.78255804, 0.79083897, 0.77801266, 0.77980699, 0.78613083, 0.78205388, 0.77613893, 0.7953645 , 0.79264521, 0.78095028, 0.78752819, 0.78192685, 0.78214121, 0.78205388, 0.77442597, 0.79592027, 0.78565049, 0.78867942, 0.79675392, 0.79044199, 0.78651986, 0.77964026, 0.78055728]), 'split2_train_score': array([ 0.82949172, 0.82895493, 0.82858599, 0.8286407 , 0.82866824, 0.82842459, 0.82851342, 0.82770607, 0.8689497 , 0.86925711, 0.86947297, 0.86757604, 0.86624605, 0.86581421, 0.86477066, 0.86359548, 0.89762854, 0.89460767, 0.89348149, 0.88879097, 0.8874776 , 0.8832091 , 0.88287813, 0.8815227 , 0.92308938, 0.91981246, 0.91850988, 0.91254171, 0.91263524, 0.90891347, 0.90450293, 0.89900059, 0.93784649, 0.93131424, 0.92813185, 0.92386596, 0.92056175, 0.91424337, 0.91250102, 0.90988792, 0.94720633, 0.94109264, 0.9368306 , 0.93117145, 0.92378297, 0.92243213, 0.91752364, 0.91490769]), 'split3_test_score': array([ 0.81382987, 0.81422685, 0.81529472, 0.81539793, 0.81531456, 0.81478262, 0.81520738, 0.81520738, 0.8283215 , 0.82721394, 0.83115195, 0.83476443, 0.83391689, 0.83661435, 0.83385536, 0.83206499, 0.8365409 , 0.83656869, 0.83990925, 0.83462748, 0.83737059, 0.83737456, 0.83847021, 0.83816652, 0.84034593, 0.83634043, 0.8363583 , 0.84499254, 0.83918874, 0.8365151 , 0.8351495 , 0.83379581, 0.83657465, 0.83618164, 0.83311698, 0.83366282, 0.82829768, 0.83853174, 0.83711652, 0.83276169, 0.83462748, 0.83735669, 0.83733883, 0.83343059, 0.83400025, 0.83702522, 0.8360685 , 0.83452426]), 'split3_train_score': array([ 0.8201583 , 0.81942203, 0.81917653, 0.81918074, 0.8185132 , 0.81832824, 0.81825083, 0.81825083, 0.86795515, 0.86490426, 0.86421005, 0.86303885, 0.86033742, 0.86166171, 0.86140008, 0.85983165, 0.9034007 , 0.8997577 , 0.89679377, 0.89580505, 0.89491222, 0.89132591, 0.88648639, 0.88655413, 0.92975623, 0.92378644, 0.91813163, 0.91625108, 0.91217562, 0.90752764, 0.90686134, 0.90497198, 0.94315096, 0.93769836, 0.93425583, 0.92898584, 0.92501943, 0.91946299, 0.91892261, 0.91114894, 0.95147321, 0.94360885, 0.93726826, 0.93088538, 0.92738702, 0.9263452 , 0.91978157, 0.91679097]), 'split4_test_score': array([ 0.81725181, 0.81716845, 0.81718035, 0.81784728, 0.81876032, 0.81876032, 0.81870077, 0.81870077, 0.84073893, 0.84286276, 0.83820225, 0.83849204, 0.83557427, 0.83565168, 0.8371324 , 0.83468305, 0.84826759, 0.85035966, 0.84651296, 0.84252136, 0.84189612, 0.8488313 , 0.84838272, 0.84884321, 0.85611781, 0.85020682, 0.85369228, 0.85417262, 0.85646119, 0.85083802, 0.85378755, 0.85116155, 0.86003597, 0.85602253, 0.85721743, 0.85676488, 0.85736431, 0.85342829, 0.85607811, 0.84875588, 0.85493085, 0.85807887, 0.85851356, 0.86301726, 0.85924797, 0.84832516, 0.85838256, 0.85569304]), 'split4_train_score': array([ 0.81850961, 0.81865562, 0.81884493, 0.81860897, 0.81757981, 0.81757981, 0.81773811, 0.81773811, 0.86161581, 0.86187049, 0.85963825, 0.85829944, 0.85542707, 0.853007 , 0.8515692 , 0.85193516, 0.89925626, 0.89162178, 0.88976418, 0.88613147, 0.88473039, 0.88245522, 0.88071038, 0.88009867, 0.92235919, 0.91911949, 0.91528804, 0.90982143, 0.9056559 , 0.90395523, 0.90110655, 0.897798 , 0.93881747, 0.93460889, 0.92432459, 0.92144316, 0.91753121, 0.91375149, 0.90779114, 0.90601727, 0.94572337, 0.94035091, 0.93146893, 0.92905854, 0.92496968, 0.91683836, 0.91395321, 0.9107506 ]), 'std_fit_time': array([ 2.35523977e-02, 1.89066728e-02, 6.24015332e-03, 9.53674316e-08, 9.86628483e-03, 3.33125278e-02, 6.24003410e-03, 1.16739202e-02, 3.65038530e-02, 1.24801159e-02, 2.52062134e-02, 7.64248870e-03, 1.59089717e-02, 1.81926285e-02, 4.04526618e-02, 1.26359008e-02, 9.86628483e-03, 1.23247585e-02, 1.59089998e-02, 7.23586648e-03, 6.24015331e-03, 9.86628483e-03, 7.64246923e-03, 1.16739074e-02, 6.23998642e-03, 2.21601937e-02, 1.81887412e-02, 7.89368088e-03, 7.64239137e-03, 1.16739839e-02, 2.33480442e-02, 1.24800205e-02, 9.86628483e-03, 8.15687068e-03, 1.70889937e-02, 1.39531404e-02, 1.87200387e-02, 6.24012947e-03, 1.16739584e-02, 1.16740986e-02, 6.24003410e-03, 1.37768858e-02, 7.48801101e-03, 7.64250817e-03, 9.86636022e-03, 2.89337553e-02, 9.86628483e-03, 9.86643562e-03]), 'std_score_time': array([ 4.03559881e-03, 7.64237190e-03, 7.64250817e-03, 1.16800773e-07, 6.23996258e-03, 7.64244977e-03, 6.23996258e-03, 7.64243030e-03, 7.64231350e-03, 6.23989105e-03, 7.64239137e-03, 6.23996258e-03, 7.64243030e-03, 6.23996258e-03, 7.64237190e-03, 7.92742313e-05, 1.78416128e-07, 1.50789149e-07, 6.23991489e-03, 7.64243030e-03, 1.90734863e-07, 7.64231350e-03, 1.16800773e-07, 6.24005795e-03, 9.53674316e-08, 1.78416128e-07, 6.24001026e-03, 6.23996258e-03, 1.90734863e-07, 7.64248870e-03, 6.24001026e-03, 6.24003410e-03, 1.16800773e-07, 1.16800773e-07, 1.16800773e-07, 1.16800773e-07, 1.90734863e-07, 9.53674316e-08, 1.16800773e-07, 1.50789149e-07, 6.23993874e-03, 7.64250817e-03, 2.13248060e-07, 7.64246923e-03, 1.16800773e-07, 1.16800773e-07, 1.16800773e-07, 1.78416128e-07]), 'std_test_score': array([ 0.02346855, 0.02410387, 0.02461588, 0.0249264 , 0.02521138, 0.0252398 , 0.02532165, 0.02542409, 0.02601523, 0.02629313, 0.02521682, 0.02776687, 0.02634107, 0.02637392, 0.02685252, 0.02587171, 0.02996235, 0.02584075, 0.02856905, 0.02552055, 0.0279128 , 0.02905229, 0.0280843 , 0.02757279, 0.02665364, 0.02527265, 0.02421783, 0.02927227, 0.02867851, 0.0243564 , 0.02588331, 0.02715535, 0.02453526, 0.02258047, 0.02552089, 0.0236628 , 0.02768036, 0.02635895, 0.02734355, 0.02621898, 0.02091603, 0.02512813, 0.0247973 , 0.02416943, 0.02480605, 0.02227363, 0.02885471, 0.02599954]), 'std_train_score': array([ 0.00534476, 0.00504539, 0.0048498 , 0.00506107, 0.00493701, 0.00487615, 0.00487138, 0.00481966, 0.00394675, 0.00371114, 0.00398961, 0.0032234 , 0.0041633 , 0.00468764, 0.00481091, 0.00445424, 0.00193498, 0.00296319, 0.00251241, 0.00334825, 0.00359496, 0.00332526, 0.00235494, 0.00256516, 0.00274037, 0.0025086 , 0.0019342 , 0.00221146, 0.00263512, 0.00198349, 0.00245035, 0.0025473 , 0.00182488, 0.00250691, 0.00329463, 0.00245399, 0.00253055, 0.00220463, 0.00354828, 0.00173048, 0.00210305, 0.00117511, 0.00204924, 0.00079921, 0.00174302, 0.00311975, 0.00212274, 0.00210846])}, {'max_depth': 13, 'min_samples_split': 110}, 0.82420168000508132)
- 模型的袋外分数
In [20]:
rf1 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=110, min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10) rf1.fit(X,y) print(rf1.oob_score_)
0.984
- 对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参
In [21]:
param_test3 = {'min_samples_split':list(range(80,150,20)), 'min_samples_leaf':list(range(10,60,10))} gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13, max_features='sqrt' ,oob_score=True, random_state=10), param_grid = param_test3, scoring='roc_auc',iid=False, cv=5) gsearch3.fit(X,y) gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_
Out[21]:
({'mean_fit_time': array([ 0.74672794, 0.72512503, 0.68016124, 0.67744122, 0.67080116, 0.68016138, 0.69264107, 0.68328109, 0.66848121, 0.66972165, 0.66456127, 0.66516123, 0.65832119, 0.65832114, 0.64584126, 0.64624114, 0.63336105, 0.63648114, 0.64312096, 0.6333612 ]), 'mean_score_time': array([ 0.0388402 , 0.03432016, 0.03119998, 0.03432002, 0.03119998, 0.03119998, 0.03120012, 0.03120003, 0.03120003, 0.03120003, 0.03120003, 0.03120008, 0.03120022, 0.03120012, 0.03120003, 0.03119993, 0.03120008, 0.03119998, 0.03120012, 0.03119998]), 'mean_test_score': array([ 0.8209294 , 0.81913348, 0.82048399, 0.8179751 , 0.8209429 , 0.82097426, 0.82486503, 0.82169239, 0.82352087, 0.82164475, 0.82069876, 0.82141332, 0.82278249, 0.82141411, 0.82042881, 0.82162093, 0.82224975, 0.82224975, 0.81890403, 0.81916643]), 'mean_train_score': array([ 0.94798589, 0.9403695 , 0.93560421, 0.93083079, 0.93868034, 0.93154815, 0.92847774, 0.92446736, 0.92980166, 0.92590865, 0.92153901, 0.91734287, 0.91917779, 0.91832166, 0.91390083, 0.91164489, 0.91021282, 0.91021282, 0.90766924, 0.90718711]), 'param_min_samples_leaf': masked_array(data = [10 10 10 10 20 20 20 20 30 30 30 30 40 40 40 40 50 50 50 50], mask = [False False False False False False False False False False False False False False False False False False False False], fill_value = ?), 'param_min_samples_split': masked_array(data = [80 100 120 140 80 100 120 140 80 100 120 140 80 100 120 140 80 100 120 140], mask = [False False False False False False False False False False False False False False False False False False False False], fill_value = ?), 'params': [{'min_samples_leaf': 10, 'min_samples_split': 80}, {'min_samples_leaf': 10, 'min_samples_split': 100}, {'min_samples_leaf': 10, 'min_samples_split': 120}, {'min_samples_leaf': 10, 'min_samples_split': 140}, {'min_samples_leaf': 20, 'min_samples_split': 80}, {'min_samples_leaf': 20, 'min_samples_split': 100}, {'min_samples_leaf': 20, 'min_samples_split': 120}, {'min_samples_leaf': 20, 'min_samples_split': 140}, {'min_samples_leaf': 30, 'min_samples_split': 80}, {'min_samples_leaf': 30, 'min_samples_split': 100}, {'min_samples_leaf': 30, 'min_samples_split': 120}, {'min_samples_leaf': 30, 'min_samples_split': 140}, {'min_samples_leaf': 40, 'min_samples_split': 80}, {'min_samples_leaf': 40, 'min_samples_split': 100}, {'min_samples_leaf': 40, 'min_samples_split': 120}, {'min_samples_leaf': 40, 'min_samples_split': 140}, {'min_samples_leaf': 50, 'min_samples_split': 80}, {'min_samples_leaf': 50, 'min_samples_split': 100}, {'min_samples_leaf': 50, 'min_samples_split': 120}, {'min_samples_leaf': 50, 'min_samples_split': 140}], 'rank_test_score': array([13, 18, 15, 20, 12, 11, 1, 6, 2, 7, 14, 10, 3, 9, 16, 8, 4, 4, 19, 17]), 'split0_test_score': array([ 0.82845449, 0.82321638, 0.83177123, 0.82597934, 0.81923669, 0.82899438, 0.83293834, 0.83292842, 0.83152709, 0.82994712, 0.82897453, 0.82060626, 0.82249389, 0.8246058 , 0.82937151, 0.82820241, 0.83519118, 0.83519118, 0.83139013, 0.82688842]), 'split0_train_score': array([ 0.94573813, 0.93956205, 0.93367277, 0.93046781, 0.93598318, 0.93093922, 0.92845625, 0.92546317, 0.93011251, 0.92690593, 0.9225145 , 0.91691937, 0.91901516, 0.91750429, 0.91413321, 0.91021505, 0.91247373, 0.91247373, 0.90874884, 0.90758272]), 'split1_test_score': array([ 0.7987527 , 0.79892737, 0.80212899, 0.80147794, 0.8070376 , 0.80161292, 0.79838748, 0.80480858, 0.80123579, 0.80077331, 0.80549932, 0.8038082 , 0.81101134, 0.80173598, 0.80329213, 0.80533259, 0.80047756, 0.80047756, 0.79987813, 0.79849466]), 'split1_train_score': array([ 0.94726922, 0.94095916, 0.93712895, 0.93209169, 0.94063698, 0.93314405, 0.92806709, 0.92445684, 0.93197508, 0.92699128, 0.9227182 , 0.91803686, 0.91820321, 0.91551295, 0.91598771, 0.91267494, 0.90997389, 0.90997389, 0.90736153, 0.90782476]), 'split2_test_score': array([ 0.79033878, 0.79120419, 0.78602563, 0.78658735, 0.7861348 , 0.7875004 , 0.80341916, 0.78557109, 0.79268491, 0.78703593, 0.7803707 , 0.78780607, 0.78845314, 0.78461041, 0.7780722 , 0.78574179, 0.78640871, 0.78640871, 0.78120038, 0.78213327]), 'split2_train_score': array([ 0.94528025, 0.9384319 , 0.9343882 , 0.92999776, 0.93773149, 0.93141795, 0.92908794, 0.92318279, 0.92840899, 0.92484711, 0.91954636, 0.91712704, 0.91932802, 0.91898067, 0.91216433, 0.91036429, 0.90849391, 0.90849391, 0.90786855, 0.90896309]), 'split3_test_score': array([ 0.83617966, 0.83349609, 0.83135837, 0.83130875, 0.82981811, 0.83488551, 0.8346374 , 0.83021905, 0.84109621, 0.84205491, 0.83554449, 0.83417095, 0.83508003, 0.84203705, 0.83420668, 0.83663022, 0.84062381, 0.84062381, 0.8338236 , 0.83752938]), 'split3_train_score': array([ 0.95388955, 0.94436348, 0.93799734, 0.93502944, 0.9404736 , 0.93425075, 0.93156185, 0.92803893, 0.93146558, 0.92710045, 0.92384884, 0.9203382 , 0.92191023, 0.9206048 , 0.91815831, 0.91558565, 0.91336419, 0.91336419, 0.91151565, 0.90790986]), 'split4_test_score': array([ 0.85092138, 0.84882336, 0.85113575, 0.84452212, 0.8624873 , 0.8518781 , 0.85494276, 0.85493482, 0.85106032, 0.84841249, 0.85310475, 0.8606751 , 0.85687405, 0.85408132, 0.85720155, 0.85219766, 0.84854746, 0.84854746, 0.8482279 , 0.85078641]), 'split4_train_score': array([ 0.94775229, 0.9385309 , 0.93483381, 0.92656726, 0.93857643, 0.92798881, 0.92521556, 0.92119505, 0.92704612, 0.92369849, 0.91906713, 0.91429287, 0.91743234, 0.9190056 , 0.90906059, 0.9093845 , 0.90675838, 0.90675838, 0.90285163, 0.90365514]), 'std_fit_time': array([ 0.06113817, 0.05438415, 0.01590904, 0.01280165, 0. , 0.01248009, 0.04478048, 0.0206959 , 0.0110957 , 0.02208383, 0.0159089 , 0.01208126, 0.00623999, 0.01167396, 0.00764216, 0.00815686, 0.00764233, 0.01167398, 0.01104286, 0.00764245]), 'std_score_time': array([ 9.61536962e-03, 6.23998642e-03, 9.53674316e-08, 6.24005795e-03, 9.53674316e-08, 9.53674316e-08, 1.78416128e-07, 1.16800773e-07, 1.90734863e-07, 1.90734863e-07, 1.16800773e-07, 1.16800773e-07, 1.78416128e-07, 9.86628483e-03, 1.16800773e-07, 0.00000000e+00, 1.16800773e-07, 9.53674316e-08, 1.78416128e-07, 9.53674316e-08]), 'std_test_score': array([ 0.02287491, 0.0214139 , 0.02327861, 0.02099498, 0.02534789, 0.02327339, 0.02110134, 0.02405753, 0.02271077, 0.02381345, 0.02528402, 0.02507702, 0.02293736, 0.02547305, 0.02723889, 0.02347831, 0.02431158, 0.02431158, 0.02458429, 0.02528014]), 'std_train_score': array([ 0.00309174, 0.00219482, 0.0016646 , 0.00276485, 0.00174528, 0.00214045, 0.00203445, 0.00228499, 0.00185049, 0.00138555, 0.00188458, 0.00194844, 0.00151734, 0.00171299, 0.00312981, 0.00225319, 0.00244899, 0.00244899, 0.00280372, 0.00182835])}, {'min_samples_leaf': 20, 'min_samples_split': 120}, 0.82486502794715444)
- 对最大特征数max_features做调参:
In [22]:
param_test4 = {'max_features':list(range(3,11,2))} gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120, min_samples_leaf=20 ,oob_score=True, random_state=10), param_grid = param_test4, scoring='roc_auc',iid=False, cv=5) gsearch4.fit(X,y) gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_
Out[22]:
({'mean_fit_time': array([ 0.51988993, 0.59320097, 0.71820383, 0.83897982]), 'mean_score_time': array([ 0.02948017, 0.03119993, 0.03432007, 0.03436146]), 'mean_test_score': array([ 0.81981191, 0.8163868 , 0.82486503, 0.81703506]), 'mean_train_score': array([ 0.90445415, 0.91814913, 0.92847774, 0.9330581 ]), 'param_max_features': masked_array(data = [3 5 7 9], mask = [False False False False], fill_value = ?), 'params': [{'max_features': 3}, {'max_features': 5}, {'max_features': 7}, {'max_features': 9}], 'rank_test_score': array([2, 4, 1, 3]), 'split0_test_score': array([ 0.81893102, 0.82697972, 0.83293834, 0.81775994]), 'split0_train_score': array([ 0.8989037 , 0.91926364, 0.92845625, 0.93346386]), 'split1_test_score': array([ 0.79912387, 0.79626763, 0.79838748, 0.80414563]), 'split1_train_score': array([ 0.90922633, 0.91967004, 0.92806709, 0.93307582]), 'split2_test_score': array([ 0.78474935, 0.77782211, 0.80341916, 0.78332023]), 'split2_train_score': array([ 0.90551398, 0.92046462, 0.92908794, 0.93042327]), 'split3_test_score': array([ 0.84169565, 0.83561793, 0.8346374 , 0.83346434]), 'split3_train_score': array([ 0.90697771, 0.91720246, 0.93156185, 0.93631056]), 'split4_test_score': array([ 0.85455967, 0.8452466 , 0.85494276, 0.84648517]), 'split4_train_score': array([ 0.90164904, 0.91414487, 0.92521556, 0.93201701]), 'std_fit_time': array([ 0.04146562, 0.01943034, 0.04183679, 0.04340904]), 'std_score_time': array([ 0.00343976, 0. , 0.00624003, 0.00178675]), 'std_test_score': array([ 0.02586294, 0.02532568, 0.02110134, 0.02209336]), 'std_train_score': array([ 0.00371326, 0.00227363, 0.00203445, 0.00193751])}, {'max_features': 7}, 0.82486502794715444)
- 用我们搜索到的最佳参数,看最终的模型拟合:
In [23]:
rf2 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120, min_samples_leaf=20,max_features=7 ,oob_score=True, random_state=10) rf2.fit(X,y) print(rf2.oob_score_)
0.984
可见此时模型的袋外分数基本没有提高,主要原因是0.984已经是一个很高的袋外分数了,如果想进一步需要提高模型的泛化能力,我们需要更多的数据。