[Python嗯~机器学习]---sklearn中对于梯度提升树GBDT和随机森林RF的参数调优

版权声明:允许转载请注明作者 https://blog.csdn.net/kepengs/article/details/86597105

GBDT参数调优

  • 框架参数
  • n_estimators: 弱学习器的最大迭代次数,或者说最大的弱学习器的个数。
  • learning_rate: 每个弱学习器的权重缩减系数ν,ν的取值范围为0<ν≤1。
  • subsample: 子采样,取值为(0,1]。
  • init: 即初始化的时候的弱学习器。
  • loss: 即我们GBDT算法中的损失函数。
  • alpha:这个参数只有GradientBoostingRegressor有,当我们使用Huber损失"huber"和分位数损失“quantile”时,需要指定分位数的值。默认是0.9,如果噪音点较多,可以适当降低这个分位数的值。

GBDT使用了CART回归决策树,因此它的参数基本来源于决策树类

  • 决策树参数
  • max_features:划分时考虑的最大特征数
  • max_depth:决策树最大深度
  • min_samples_split:内部节点再划分所需最小样本数
  • min_samples_leaf:叶子节点最少样本数
  • min_weight_fraction_leaf:叶子节点最小的样本权重和
  • max_leaf_nodes:最大叶子节点数
  • min_impurity_split:节点划分最小不纯度
  • 加载类库

In [1]:

import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import cross_validation, metrics
from sklearn.model_selection import GridSearchCV

import matplotlib.pylab as plt
%matplotlib inline
  • 用pandas分析数据

In [2]:

train = pd.read_csv('train_modified.csv')
target = 'Disbursed'                       # 二分类
IDcol = 'ID'
train['Disbursed'].value_counts()

Out[2]:

0    19680
1      320
Name: Disbursed, dtype: int64
  • 得到训练集。最后一列Disbursed是分类输出。前面的所有列(不考虑ID列)都是样本特征

In [3]:

x_columns = [x for x in train.columns if x not in [target, IDcol]]
X = train[x_columns]
y = train['Disbursed']
  • 不调参,默认参数

In [4]:

gbm0 = GradientBoostingClassifier(random_state=10)
gbm0.fit(X,y)
y_pred = gbm0.predict(X)
y_predprob = gbm0.predict_proba(X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(y.values, y_pred))
print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
Accuracy : 0.9852
AUC Score (Train): 0.900531

调参

  • 首先,从步长(learning rate)和迭代次数(n_estimators)入手。
  • 开始选择一个较小的步长来网格搜索最好的迭代次数。
  • 我们将步长初始值设置为0.1。对于迭代次数进行网格搜索如下:

In [5]:

param_test1 = {'n_estimators':list(range(20,81,10))}
gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=300,
                       min_samples_leaf=20,max_depth=8,max_features='sqrt', subsample=0.8,random_state=10), 
                       param_grid = param_test1, scoring='roc_auc',iid=False,cv=5)
gsearch1.fit(X,y)
gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_

Out[5]:

({'mean_fit_time': array([ 0.46405239,  0.60372586,  0.75624413,  0.89232144,  1.07256217,
          1.20764227,  1.36656241]),
  'mean_score_time': array([ 0.00360017,  0.00936003,  0.00311999,  0.00936007,  0.01247988,
          0.01560001,  0.02184005]),
  'mean_test_score': array([ 0.81284735,  0.81437929,  0.81451108,  0.81618196,  0.81778932,
          0.81533362,  0.81321535]),
  'mean_train_score': array([ 0.92266699,  0.93547209,  0.94317783,  0.95054641,  0.95735062,
          0.96186315,  0.96634598]),
  'param_n_estimators': masked_array(data = [20 30 40 50 60 70 80],
               mask = [False False False False False False False],
         fill_value = ?),
  'params': [{'n_estimators': 20},
   {'n_estimators': 30},
   {'n_estimators': 40},
   {'n_estimators': 50},
   {'n_estimators': 60},
   {'n_estimators': 70},
   {'n_estimators': 80}],
  'rank_test_score': array([7, 5, 4, 2, 1, 3, 6]),
  'split0_test_score': array([ 0.80985614,  0.8153225 ,  0.81237297,  0.81722204,  0.81813905,
          0.81521731,  0.81381201]),
  'split0_train_score': array([ 0.92169041,  0.93513613,  0.94237537,  0.94910189,  0.95769705,
          0.96291767,  0.96594151]),
  'split1_test_score': array([ 0.79491394,  0.79054918,  0.79181156,  0.79243879,  0.7998821 ,
          0.79331412,  0.79071392]),
  'split1_train_score': array([ 0.92540536,  0.93807301,  0.94369035,  0.95120239,  0.95651828,
          0.96082188,  0.96626927]),
  'split2_test_score': array([ 0.7933955 ,  0.80116036,  0.80013219,  0.80208532,  0.79902066,
          0.79699608,  0.79585279]),
  'split2_train_score': array([ 0.91999321,  0.93661623,  0.94265511,  0.94995018,  0.95538442,
          0.95924315,  0.96427458]),
  'split3_test_score': array([ 0.81871268,  0.81660871,  0.82047526,  0.82303774,  0.82663634,
          0.82647358,  0.82416317]),
  'split3_train_score': array([ 0.92746034,  0.93637333,  0.9459489 ,  0.95354778,  0.95929489,
          0.9638652 ,  0.96732064]),
  'split4_test_score': array([ 0.84735852,  0.84825568,  0.84776343,  0.84612591,  0.84526844,
          0.84466702,  0.84153487]),
  'split4_train_score': array([ 0.91878565,  0.93116177,  0.94121942,  0.94892983,  0.95785845,
          0.96246784,  0.96792392]),
  'std_fit_time': array([ 0.03545224,  0.03637669,  0.03127342,  0.00623996,  0.02454179,
          0.02329797,  0.01590896]),
  'std_score_time': array([  4.58717312e-03,   7.64243030e-03,   6.23998642e-03,
           7.64246923e-03,   6.23993874e-03,   1.78416128e-07,
           7.64231350e-03]),
  'std_test_score': array([ 0.01966902,  0.01947349,  0.01932813,  0.01847797,  0.01735756,
          0.0190036 ,  0.01860096]),
  'std_train_score': array([ 0.00327544,  0.00234852,  0.00159337,  0.00170259,  0.00132036,
          0.00163918,  0.00125698])},
 {'n_estimators': 60},
 0.81778931656504061)
  • 找到了一个合适的迭代次数(如上为60),现在我们开始对决策树进行调参。
  • 首先我们对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索。

In [6]:

param_test2 = {'max_depth':list(range(3,14,2)), 'min_samples_split':list(range(100,801,200))}
gsearch2 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60, min_samples_leaf=20, 
      max_features='sqrt', subsample=0.8, random_state=10), 
   param_grid = param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(X,y)
gsearch2.cv_results_, gsearch2.best_params_, gsearch2.best_score_

Out[6]:

({'mean_fit_time': array([ 0.5040072 ,  0.4648807 ,  0.45240068,  0.45864086,  0.72764158,
          0.68640118,  0.66456118,  0.66424136,  1.03916183,  0.94536152,
          0.88920159,  0.8642415 ,  1.3946425 ,  1.16688209,  1.06724191,
          1.02024179,  1.70352292,  1.39584255,  1.2236423 ,  1.14816208,
          2.01656666,  1.50384254,  1.29792228,  1.19808207]),
  'mean_score_time': array([ 0.00512018,  0.00312004,  0.00623994,  0.00311995,  0.00935993,
          0.00936007,  0.01247993,  0.00624008,  0.01248007,  0.01248012,
          0.01248002,  0.01560006,  0.01559997,  0.01559992,  0.01247993,
          0.01248002,  0.01560001,  0.01748028,  0.01559992,  0.01560006,
          0.0187201 ,  0.0187201 ,  0.01560001,  0.01872015]),
  'mean_test_score': array([ 0.81198869,  0.81267388,  0.81237654,  0.80924956,  0.81845981,
          0.81630145,  0.81314548,  0.81261631,  0.8180676 ,  0.82137243,
          0.8170295 ,  0.81383067,  0.81107287,  0.80944487,  0.81476158,
          0.81600927,  0.81100776,  0.81309427,  0.81712994,  0.81346505,
          0.81483581,  0.8082464 ,  0.81923074,  0.81382074]),
  'mean_train_score': array([ 0.86494381,  0.8622539 ,  0.86106945,  0.85794241,  0.91987404,
          0.91122178,  0.90112146,  0.89483856,  0.95660931,  0.94393214,
          0.9326904 ,  0.92536259,  0.98193106,  0.96538959,  0.95290674,
          0.94331725,  0.99221707,  0.97872749,  0.96704603,  0.95578115,
          0.99648755,  0.98611237,  0.97292428,  0.96347527]),
  'param_max_depth': masked_array(data = [3 3 3 3 5 5 5 5 7 7 7 7 9 9 9 9 11 11 11 11 13 13 13 13],
               mask = [False False False False False False False False False False False False
   False False False False False False False False False False False False],
         fill_value = ?),
  'param_min_samples_split': masked_array(data = [100 300 500 700 100 300 500 700 100 300 500 700 100 300 500 700 100 300
   500 700 100 300 500 700],
               mask = [False False False False False False False False False False False False
   False False False False False False False False False False False False],
         fill_value = ?),
  'params': [{'max_depth': 3, 'min_samples_split': 100},
   {'max_depth': 3, 'min_samples_split': 300},
   {'max_depth': 3, 'min_samples_split': 500},
   {'max_depth': 3, 'min_samples_split': 700},
   {'max_depth': 5, 'min_samples_split': 100},
   {'max_depth': 5, 'min_samples_split': 300},
   {'max_depth': 5, 'min_samples_split': 500},
   {'max_depth': 5, 'min_samples_split': 700},
   {'max_depth': 7, 'min_samples_split': 100},
   {'max_depth': 7, 'min_samples_split': 300},
   {'max_depth': 7, 'min_samples_split': 500},
   {'max_depth': 7, 'min_samples_split': 700},
   {'max_depth': 9, 'min_samples_split': 100},
   {'max_depth': 9, 'min_samples_split': 300},
   {'max_depth': 9, 'min_samples_split': 500},
   {'max_depth': 9, 'min_samples_split': 700},
   {'max_depth': 11, 'min_samples_split': 100},
   {'max_depth': 11, 'min_samples_split': 300},
   {'max_depth': 11, 'min_samples_split': 500},
   {'max_depth': 11, 'min_samples_split': 700},
   {'max_depth': 13, 'min_samples_split': 100},
   {'max_depth': 13, 'min_samples_split': 300},
   {'max_depth': 13, 'min_samples_split': 500},
   {'max_depth': 13, 'min_samples_split': 700}],
  'rank_test_score': array([19, 16, 18, 23,  3,  7, 14, 17,  4,  1,  6, 11, 20, 22, 10,  8, 21,
         15,  5, 13,  9, 24,  2, 12]),
  'split0_test_score': array([ 0.81595568,  0.81469131,  0.81249008,  0.81140037,  0.81834548,
          0.813157  ,  0.81389934,  0.81455435,  0.81247618,  0.81584651,
          0.81171002,  0.81463375,  0.81160283,  0.79811754,  0.80339534,
          0.8241334 ,  0.80091821,  0.79846688,  0.81144404,  0.80298249,
          0.80540404,  0.80905623,  0.81889926,  0.80580896]),
  'split0_train_score': array([ 0.86212766,  0.85983599,  0.85925715,  0.85583682,  0.92027642,
          0.91036379,  0.89830898,  0.89271483,  0.95575894,  0.94462226,
          0.93295511,  0.9235495 ,  0.98033167,  0.96412311,  0.95247272,
          0.93674885,  0.99078257,  0.9756765 ,  0.96545844,  0.95588014,
          0.99602589,  0.98666841,  0.972869  ,  0.9635465 ]),
  'split1_test_score': array([ 0.80087851,  0.79917945,  0.80264704,  0.79988607,  0.80374865,
          0.800158  ,  0.80373277,  0.79130939,  0.79127564,  0.80364147,
          0.80661085,  0.78179584,  0.78692677,  0.78991997,  0.79036061,
          0.78085302,  0.78124206,  0.77386226,  0.79696233,  0.77971966,
          0.79841527,  0.77129978,  0.8000925 ,  0.78926496]),
  'split1_train_score': array([ 0.87120242,  0.86666361,  0.86601828,  0.86354586,  0.92061919,
          0.9143518 ,  0.90965767,  0.89885879,  0.95923236,  0.94287258,
          0.94052571,  0.93206341,  0.98074341,  0.96363831,  0.95255943,
          0.94647502,  0.99102225,  0.97921803,  0.96812265,  0.95409524,
          0.99580569,  0.98611847,  0.9731619 ,  0.96292921]),
  'split2_test_score': array([ 0.77799678,  0.78258186,  0.78163904,  0.77523183,  0.79288737,
          0.79426885,  0.78249849,  0.78640673,  0.80465376,  0.80370498,
          0.7928437 ,  0.79515808,  0.78876278,  0.78355445,  0.80375659,
          0.79261346,  0.79967964,  0.80688874,  0.79119228,  0.79433633,
          0.79785752,  0.80018777,  0.80151367,  0.79806394]),
  'split2_train_score': array([ 0.86714334,  0.86632929,  0.86495612,  0.8616039 ,  0.91754957,
          0.91427265,  0.89810478,  0.89679637,  0.95248153,  0.94382396,
          0.92439481,  0.92567903,  0.98094835,  0.96515153,  0.95307513,
          0.94545206,  0.9926604 ,  0.98029842,  0.96971068,  0.95601673,
          0.99651765,  0.98676368,  0.97312915,  0.9661503 ]),
  'split3_test_score': array([ 0.83055648,  0.83071527,  0.829175  ,  0.82860733,  0.83614988,
          0.83314675,  0.83231509,  0.83529638,  0.83712644,  0.83983184,
          0.83185063,  0.83223371,  0.82318066,  0.82017554,  0.83634043,
          0.83131669,  0.83193201,  0.83867664,  0.82928219,  0.84128081,
          0.83871237,  0.81992743,  0.83226348,  0.82270627]),
  'split3_train_score': array([ 0.86569722,  0.86175636,  0.86113051,  0.85574279,  0.92692938,
          0.91676802,  0.90241607,  0.89450507,  0.95932218,  0.94546174,
          0.93014278,  0.92215499,  0.98479307,  0.96734061,  0.95563389,
          0.942621  ,  0.99382987,  0.98099536,  0.96486744,  0.95758627,
          0.99731086,  0.98479133,  0.9746033 ,  0.9633598 ]),
  'split4_test_score': array([ 0.83455602,  0.83620149,  0.83593155,  0.83112217,  0.84116767,
          0.84077665,  0.83328173,  0.83551472,  0.84480596,  0.84383733,
          0.84213232,  0.84533195,  0.84489131,  0.85545684,  0.8399549 ,
          0.8511298 ,  0.84126691,  0.84757685,  0.85676885,  0.84900597,
          0.83378986,  0.84076077,  0.84338478,  0.85325958]),
  'split4_train_score': array([ 0.85854842,  0.85668424,  0.85398517,  0.85298268,  0.91399563,
          0.90035266,  0.89711979,  0.89131772,  0.95625156,  0.94288015,
          0.93543361,  0.92336602,  0.98283882,  0.96669441,  0.95079251,
          0.9452893 ,  0.99279028,  0.97744912,  0.96707091,  0.95532735,
          0.99677767,  0.98621995,  0.97085807,  0.96139055]),
  'std_fit_time': array([ 0.03818128,  0.01167409,  0.00986621,  0.00764233,  0.00829129,
          0.00986628,  0.01247991,  0.0154769 ,  0.01618613,  0.00764239,
          0.01708904,  0.00764245,  0.02116084,  0.0152848 ,  0.01213157,
          0.01248006,  0.04972466,  0.01561664,  0.00723594,  0.00764243,
          0.0453095 ,  0.01590906,  0.01167405,  0.00623994]),
  'std_score_time': array([  6.51612114e-03,   6.24008179e-03,   7.64233297e-03,
           6.23989105e-03,   7.64235243e-03,   7.64246923e-03,
           6.23996258e-03,   7.64250817e-03,   6.24003410e-03,
           6.24005795e-03,   6.24001026e-03,   1.16800773e-07,
           1.50789149e-07,   1.78416128e-07,   6.23996258e-03,
           6.24001026e-03,   1.78416128e-07,   3.76062393e-03,
           9.53674316e-08,   1.16800773e-07,   6.24003411e-03,
           6.24003411e-03,   9.53674316e-08,   6.24012947e-03]),
  'std_test_score': array([ 0.02073003,  0.01985316,  0.01937264,  0.02050678,  0.01843348,
          0.01810378,  0.01898077,  0.02089692,  0.02003588,  0.01733484,
          0.01772916,  0.02326604,  0.0217777 ,  0.02612314,  0.01972845,
          0.02575692,  0.02222414,  0.0269634 ,  0.02379388,  0.02702376,
          0.017755  ,  0.02290974,  0.01693249,  0.02258241]),
  'std_train_score': array([ 0.00432221,  0.00382542,  0.00431445,  0.00396676,  0.00425332,
          0.00580929,  0.00463825,  0.00272077,  0.00253495,  0.00100568,
          0.00537205,  0.00353732,  0.00167029,  0.00143085,  0.00156491,
          0.00352269,  0.00114992,  0.00193879,  0.0017622 ,  0.00112889,
          0.00053684,  0.00070571,  0.00119915,  0.00153743])},
 {'max_depth': 7, 'min_samples_split': 300},
 0.82137242759146323)
  • 如上输出中最好的深度是 7,内部节点划分是300 。
  • 由于决策树深度7是一个比较合理的值,我们把它定下来,对于内部节点再划分所需最小样本数min_samples_split,
  • 我们暂时不能一起定下来,因为这个还和决策树其他的参数存在关联。
  • 下面我们再对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参。

In [7]:

param_test3 = {'min_samples_split':list(range(800,1900,200)), 'min_samples_leaf':list(range(60,101,10))}
gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=7,
                                     max_features='sqrt', subsample=0.8, random_state=10), 
                       param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(X,y)
gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_

Out[7]:

({'mean_fit_time': array([ 0.98147159,  0.91485825,  0.81120129,  0.79560137,  0.76364188,
          0.73632126,  0.84240146,  0.81744151,  0.80808153,  0.79244418,
          0.75816131,  0.73712134,  0.84552155,  0.81472139,  0.79560137,
          0.76996374,  0.7618021 ,  0.73048124,  0.85560212,  0.81432137,
          0.79560142,  0.76440134,  0.75192137,  0.74256115,  0.83616142,
          0.81120152,  0.77376142,  0.75260158,  0.74256115,  0.73320136]),
  'mean_score_time': array([ 0.01524043,  0.01424046,  0.01248012,  0.01248002,  0.01559997,
          0.01248002,  0.01247997,  0.01559992,  0.00935984,  0.0156002 ,
          0.01560006,  0.01248002,  0.01248002,  0.01248012,  0.01248002,
          0.01348019,  0.01247997,  0.01560001,  0.01560016,  0.00936007,
          0.00935998,  0.01560001,  0.01248002,  0.00624003,  0.01248012,
          0.00935988,  0.01559997,  0.01248007,  0.01248016,  0.01559992]),
  'mean_test_score': array([ 0.81827919,  0.81731255,  0.8222033 ,  0.8144698 ,  0.81495371,
          0.81528439,  0.81590487,  0.81572702,  0.82021405,  0.81512044,
          0.81395095,  0.81586914,  0.82063564,  0.81489972,  0.82009218,
          0.81850308,  0.81855231,  0.81665754,  0.81960231,  0.81560198,
          0.81936055,  0.81361749,  0.81429473,  0.81299027,  0.81999889,
          0.82209294,  0.81820535,  0.81921804,  0.81544596,  0.8170426 ]),
  'mean_train_score': array([ 0.92087206,  0.91440802,  0.91027269,  0.90526927,  0.90050771,
          0.89715931,  0.91948312,  0.91372936,  0.91031055,  0.90391826,
          0.90025277,  0.89628226,  0.91950803,  0.91364818,  0.90869026,
          0.90336797,  0.89789352,  0.89484074,  0.91861416,  0.91142069,
          0.90642938,  0.9016713 ,  0.89810865,  0.89368284,  0.91955286,
          0.91187216,  0.9060234 ,  0.90194315,  0.89785241,  0.89320016]),
  'param_min_samples_leaf': masked_array(data = [60 60 60 60 60 60 70 70 70 70 70 70 80 80 80 80 80 80 90 90 90 90 90 90
   100 100 100 100 100 100],
               mask = [False False False False False False False False False False False False
   False False False False False False False False False False False False
   False False False False False False],
         fill_value = ?),
  'param_min_samples_split': masked_array(data = [800 1000 1200 1400 1600 1800 800 1000 1200 1400 1600 1800 800 1000 1200
   1400 1600 1800 800 1000 1200 1400 1600 1800 800 1000 1200 1400 1600 1800],
               mask = [False False False False False False False False False False False False
   False False False False False False False False False False False False
   False False False False False False],
         fill_value = ?),
  'params': [{'min_samples_leaf': 60, 'min_samples_split': 800},
   {'min_samples_leaf': 60, 'min_samples_split': 1000},
   {'min_samples_leaf': 60, 'min_samples_split': 1200},
   {'min_samples_leaf': 60, 'min_samples_split': 1400},
   {'min_samples_leaf': 60, 'min_samples_split': 1600},
   {'min_samples_leaf': 60, 'min_samples_split': 1800},
   {'min_samples_leaf': 70, 'min_samples_split': 800},
   {'min_samples_leaf': 70, 'min_samples_split': 1000},
   {'min_samples_leaf': 70, 'min_samples_split': 1200},
   {'min_samples_leaf': 70, 'min_samples_split': 1400},
   {'min_samples_leaf': 70, 'min_samples_split': 1600},
   {'min_samples_leaf': 70, 'min_samples_split': 1800},
   {'min_samples_leaf': 80, 'min_samples_split': 800},
   {'min_samples_leaf': 80, 'min_samples_split': 1000},
   {'min_samples_leaf': 80, 'min_samples_split': 1200},
   {'min_samples_leaf': 80, 'min_samples_split': 1400},
   {'min_samples_leaf': 80, 'min_samples_split': 1600},
   {'min_samples_leaf': 80, 'min_samples_split': 1800},
   {'min_samples_leaf': 90, 'min_samples_split': 800},
   {'min_samples_leaf': 90, 'min_samples_split': 1000},
   {'min_samples_leaf': 90, 'min_samples_split': 1200},
   {'min_samples_leaf': 90, 'min_samples_split': 1400},
   {'min_samples_leaf': 90, 'min_samples_split': 1600},
   {'min_samples_leaf': 90, 'min_samples_split': 1800},
   {'min_samples_leaf': 100, 'min_samples_split': 800},
   {'min_samples_leaf': 100, 'min_samples_split': 1000},
   {'min_samples_leaf': 100, 'min_samples_split': 1200},
   {'min_samples_leaf': 100, 'min_samples_split': 1400},
   {'min_samples_leaf': 100, 'min_samples_split': 1600},
   {'min_samples_leaf': 100, 'min_samples_split': 1800}],
  'rank_test_score': array([12, 14,  1, 26, 24, 22, 17, 19,  4, 23, 28, 18,  3, 25,  5, 11, 10,
         16,  7, 20,  8, 29, 27, 30,  6,  2, 13,  9, 21, 15]),
  'split0_test_score': array([ 0.81074338,  0.81265085,  0.82638029,  0.80854611,  0.81629907,
          0.81756542,  0.81436579,  0.81356787,  0.81387751,  0.8167417 ,
          0.81225586,  0.8124603 ,  0.82038396,  0.81525303,  0.82509011,
          0.82155107,  0.82085437,  0.81130907,  0.8141812 ,  0.81352023,
          0.82424654,  0.80720036,  0.82315088,  0.81182117,  0.81980437,
          0.82184681,  0.81924265,  0.81771429,  0.81448687,  0.81808943]),
  'split0_train_score': array([ 0.92057391,  0.91292057,  0.91244929,  0.90219836,  0.90016571,
          0.90358281,  0.91885723,  0.91618161,  0.9070643 ,  0.90089119,
          0.89924696,  0.89898074,  0.91949872,  0.91556109,  0.9079606 ,
          0.89851702,  0.89702054,  0.89789476,  0.91984   ,  0.91114658,
          0.90538236,  0.90245143,  0.89943552,  0.89723019,  0.92114394,
          0.91206199,  0.90619591,  0.90022228,  0.89914672,  0.89662493]),
  'split1_test_score': array([ 0.7951283 ,  0.79233756,  0.80273636,  0.79005891,  0.80018579,
          0.79693852,  0.79603341,  0.79458048,  0.79516999,  0.795168  ,
          0.79620808,  0.80128938,  0.80107303,  0.79358208,  0.79414579,
          0.80020762,  0.79499532,  0.79650383,  0.8008408 ,  0.79762529,
          0.79116052,  0.7969663 ,  0.78846704,  0.79792897,  0.80096981,
          0.80635877,  0.79628747,  0.79954864,  0.79475316,  0.79569002]),
  'split1_train_score': array([ 0.92293766,  0.91419474,  0.91336419,  0.91015935,  0.9062108 ,
          0.90082395,  0.9190685 ,  0.91656581,  0.91922258,  0.9111771 ,
          0.90326077,  0.90058762,  0.91803512,  0.91581515,  0.91237758,
          0.90895974,  0.90145415,  0.89436117,  0.9224696 ,  0.91450935,
          0.90733945,  0.90298846,  0.89767568,  0.89760633,  0.91678551,
          0.91257843,  0.90944442,  0.90616998,  0.90128072,  0.89712351]),
  'split2_test_score': array([ 0.79557887,  0.79529702,  0.79065239,  0.79493577,  0.79307792,
          0.78586882,  0.78244688,  0.78915976,  0.79308983,  0.79021969,
          0.78667866,  0.78708953,  0.78209159,  0.7814465 ,  0.78694264,
          0.78666079,  0.79528709,  0.79119228,  0.79108113,  0.78673423,
          0.79084691,  0.78443772,  0.78295104,  0.78006701,  0.78457865,
          0.79912983,  0.78818915,  0.78846108,  0.78831817,  0.78356636]),
  'split2_train_score': array([ 0.92102547,  0.91630368,  0.9102349 ,  0.90583752,  0.90075758,
          0.89876488,  0.91873529,  0.90970829,  0.91140611,  0.90462624,
          0.90124078,  0.89356908,  0.9246712 ,  0.91307055,  0.9083247 ,
          0.90967033,  0.90187854,  0.89756452,  0.92094285,  0.90886   ,
          0.91007202,  0.90490053,  0.89955735,  0.89381644,  0.91946151,
          0.91229322,  0.91028452,  0.90675788,  0.90101785,  0.89461734]),
  'split3_test_score': array([ 0.84421645,  0.83124127,  0.84106842,  0.83822806,  0.83325394,
          0.83313881,  0.8347823 ,  0.83651312,  0.84387108,  0.83251358,
          0.83471084,  0.83596727,  0.84400803,  0.83999659,  0.84291039,
          0.83963732,  0.83339883,  0.83359931,  0.83206102,  0.83171367,
          0.83399231,  0.84131058,  0.83491925,  0.83320035,  0.84262656,
          0.83396651,  0.83851983,  0.83680489,  0.83238456,  0.83610026]),
  'split3_train_score': array([ 0.91985712,  0.91278858,  0.90916989,  0.90917981,  0.8996649 ,
          0.89287946,  0.92095439,  0.91663491,  0.90861548,  0.90378068,
          0.90020727,  0.894334  ,  0.9203521 ,  0.91438653,  0.90699024,
          0.90188499,  0.89512101,  0.8920427 ,  0.91853568,  0.91232387,
          0.90701095,  0.90172695,  0.89739916,  0.88956123,  0.92465557,
          0.91743494,  0.90251656,  0.90173427,  0.89879403,  0.89120223]),
  'split4_test_score': array([ 0.84572893,  0.85503605,  0.85017904,  0.84058014,  0.83195185,
          0.84291039,  0.85189596,  0.8448139 ,  0.85506185,  0.84095925,
          0.83990131,  0.84253922,  0.85562159,  0.84422042,  0.85137195,
          0.8444586 ,  0.84822591,  0.8506832 ,  0.8598474 ,  0.84841646,
          0.85655647,  0.83817248,  0.84198544,  0.84193383,  0.85201505,
          0.84916278,  0.84878763,  0.85356128,  0.84728706,  0.85176694]),
  'split4_train_score': array([ 0.91996616,  0.91583252,  0.90614517,  0.89897131,  0.89573955,
          0.88974545,  0.91980018,  0.9095562 ,  0.90524428,  0.89911608,
          0.8973081 ,  0.89393988,  0.91498299,  0.90940758,  0.90779821,
          0.8978078 ,  0.89399335,  0.89234056,  0.91128267,  0.91026368,
          0.90234214,  0.89628911,  0.89647557,  0.89019999,  0.91571777,
          0.9049922 ,  0.90167559,  0.89483134,  0.8890227 ,  0.8864328 ]),
  'std_fit_time': array([  3.07595384e-02,   4.92428119e-02,   9.86628483e-03,
           9.86636022e-03,   8.46375555e-03,   1.52848801e-02,
           9.86620943e-03,   7.64233297e-03,   2.68393017e-02,
           3.65854644e-02,   1.59089904e-02,   1.88016029e-02,
           6.24008179e-03,   1.15942915e-02,   9.86636022e-03,
           5.86673508e-03,   5.50535106e-03,   5.95249423e-03,
           1.23117482e-02,   1.16739329e-02,   9.53674316e-08,
           9.86636022e-03,   2.29272195e-02,   1.24799252e-02,
           7.64250817e-03,   9.86636022e-03,   7.64239137e-03,
           6.04519629e-03,   7.64250817e-03,   9.86628483e-03]),
  'std_score_time': array([  1.13410486e-03,   7.49479274e-03,   6.24005795e-03,
           6.24001026e-03,   1.50789149e-07,   6.24001026e-03,
           6.23998642e-03,   1.78416128e-07,   7.64227456e-03,
           0.00000000e+00,   9.86628483e-03,   6.24001026e-03,
           6.24001026e-03,   6.24005795e-03,   6.24001026e-03,
           4.23979759e-03,   6.23998642e-03,   2.33601546e-07,
           9.53674316e-08,   7.64246923e-03,   7.64239137e-03,
           1.78416128e-07,   6.24001026e-03,   7.64244977e-03,
           6.24005795e-03,   7.64231350e-03,   1.50789149e-07,
           6.24003410e-03,   6.24008179e-03,   9.53674316e-08]),
  'std_test_score': array([ 0.02251349,  0.02344029,  0.02249624,  0.02125448,  0.01626215,
          0.02139638,  0.02517299,  0.02207155,  0.02520756,  0.01995466,
          0.02081276,  0.02082155,  0.02697661,  0.02475173,  0.0256756 ,
          0.02226338,  0.02098782,  0.02248567,  0.024371  ,  0.02234824,
          0.02541561,  0.02253777,  0.0241664 ,  0.02262002,  0.02511487,
          0.01815867,  0.02336834,  0.02376507,  0.02220721,  0.0250865 ]),
  'std_train_score': array([ 0.00111623,  0.00144937,  0.00255143,  0.00421006,  0.00335113,
          0.00510982,  0.00082317,  0.00334919,  0.00489291,  0.00413364,
          0.0019854 ,  0.00291416,  0.0031628 ,  0.00233309,  0.00189464,
          0.00505241,  0.00323168,  0.00249223,  0.00388707,  0.0019145 ,
          0.00253918,  0.00288938,  0.00120143,  0.00337974,  0.00319198,
          0.00397468,  0.00349547,  0.00435042,  0.00452325,  0.00397288])},
 {'min_samples_leaf': 60, 'min_samples_split': 1200},
 0.82220329966971539)
  • 把调节的参数放到GBDT类里面去看效果。现在我们用新参数拟合数据:

In [8]:

gbm1 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=7, min_samples_leaf =60, 
               min_samples_split =1200, max_features='sqrt', subsample=0.8, random_state=10)
gbm1.fit(X,y)
y_pred = gbm1.predict(X)
y_predprob = gbm1.predict_proba(X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(y.values, y_pred))
print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
Accuracy : 0.984
AUC Score (Train): 0.908099
  • 对比我们最开始完全不调参的拟合效果,可见精确度稍有下降,主要原理是我们使用了0.8的子采样,20%的数据没有参与拟合。
  • 我们再对最大特征数max_features进行网格搜索:

In [9]:

param_test4 = {'max_features':list(range(7,20,2))}
gsearch4 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=7, min_samples_leaf =60, 
               min_samples_split =1200, subsample=0.8, random_state=10), 
                       param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(X,y)
gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_

Out[9]:

({'mean_fit_time': array([ 0.8245635 ,  1.01770296,  1.08365922,  1.13576221,  1.25328522,
          1.33576236,  1.4449626 ]),
  'mean_score_time': array([ 0.01247993,  0.01312051,  0.01488018,  0.01248002,  0.01560011,
          0.00935998,  0.01248012]),
  'mean_test_score': array([ 0.8222033 ,  0.82241251,  0.82108184,  0.82064239,  0.82198258,
          0.81354802,  0.81876866]),
  'mean_train_score': array([ 0.91027269,  0.91158373,  0.91466399,  0.91573707,  0.91639903,
          0.91679107,  0.91668619]),
  'param_max_features': masked_array(data = [7 9 11 13 15 17 19],
               mask = [False False False False False False False],
         fill_value = ?),
  'params': [{'max_features': 7},
   {'max_features': 9},
   {'max_features': 11},
   {'max_features': 13},
   {'max_features': 15},
   {'max_features': 17},
   {'max_features': 19}],
  'rank_test_score': array([2, 1, 4, 5, 3, 7, 6]),
  'split0_test_score': array([ 0.82638029,  0.81711088,  0.82152129,  0.81253771,  0.81923074,
          0.80582087,  0.81814501]),
  'split0_train_score': array([ 0.91244929,  0.91012226,  0.91923163,  0.91558763,  0.92099594,
          0.9204073 ,  0.91439534]),
  'split1_test_score': array([ 0.80273636,  0.79887775,  0.79841328,  0.80308768,  0.80473315,
          0.79439787,  0.79727992]),
  'split1_train_score': array([ 0.91336419,  0.91048524,  0.91247397,  0.91787496,  0.91871246,
          0.9160614 ,  0.9178571 ]),
  'split2_test_score': array([ 0.79065239,  0.7965336 ,  0.79322877,  0.80307974,  0.80786927,
          0.79174209,  0.79920525]),
  'split2_train_score': array([ 0.9102349 ,  0.91210665,  0.91137162,  0.9146365 ,  0.91232176,
          0.91599317,  0.91339322]),
  'split3_test_score': array([ 0.84106842,  0.84023477,  0.83887513,  0.83253343,  0.83357747,
          0.8346513 ,  0.83625905]),
  'split3_train_score': array([ 0.90916989,  0.90980778,  0.91932914,  0.91526199,  0.91442685,
          0.91806489,  0.92131489]),
  'split4_test_score': array([ 0.85017904,  0.85930553,  0.85337073,  0.85197337,  0.84450227,
          0.84112797,  0.84295406]),
  'split4_train_score': array([ 0.90614517,  0.91539671,  0.91091361,  0.91532427,  0.91553814,
          0.91342858,  0.91647041]),
  'std_fit_time': array([ 0.01370075,  0.04088774,  0.07686664,  0.01820694,  0.02534589,
          0.01270362,  0.01270356]),
  'std_score_time': array([  6.23996258e-03,   8.00890142e-03,   1.43980980e-03,
           6.24001026e-03,   1.16800773e-07,   7.64239137e-03,
           6.24005795e-03]),
  'std_test_score': array([ 0.02249624,  0.02420925,  0.02301749,  0.01900172,  0.01513855,
          0.0205326 ,  0.01863185]),
  'std_train_score': array([ 0.00255143,  0.00206441,  0.00380337,  0.00111358,  0.00308993,
          0.00233132,  0.00279049])},
 {'max_features': 9},
 0.82241250635162599)
  • 们再对子采样的比例进行网格搜索:

In [10]:

param_test5 = {'subsample':[0.6,0.7,0.75,0.8,0.85,0.9]}
gsearch5 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=7, min_samples_leaf =60, 
               min_samples_split =1200, max_features=9, random_state=10), 
                       param_grid = param_test5, scoring='roc_auc',iid=False, cv=5)
gsearch5.fit(X,y)
gsearch5.cv_results_, gsearch5.best_params_, gsearch5.best_score_

Out[10]:

({'mean_fit_time': array([ 0.83184319,  0.88940153,  0.90480151,  0.91104164,  0.90820169,
          0.91936193]),
  'mean_score_time': array([ 0.00935998,  0.00936003,  0.01560016,  0.01248007,  0.01248007,
          0.00935998]),
  'mean_test_score': array([ 0.8182768 ,  0.8234379 ,  0.81673217,  0.82241251,  0.8228468 ,
          0.81738003]),
  'mean_train_score': array([ 0.8997337 ,  0.90444219,  0.90921547,  0.91158373,  0.91429419,
          0.91560448]),
  'param_subsample': masked_array(data = [0.6 0.7 0.75 0.8 0.85 0.9],
               mask = [False False False False False False],
         fill_value = ?),
  'params': [{'subsample': 0.6},
   {'subsample': 0.7},
   {'subsample': 0.75},
   {'subsample': 0.8},
   {'subsample': 0.85},
   {'subsample': 0.9}],
  'rank_test_score': array([4, 1, 6, 3, 2, 5]),
  'split0_test_score': array([ 0.81290492,  0.82635051,  0.81494141,  0.81711088,  0.82498491,
          0.81622563]),
  'split0_train_score': array([ 0.89892342,  0.90458059,  0.91068262,  0.91012226,  0.91387183,
          0.91733049]),
  'split1_test_score': array([ 0.79736527,  0.80361963,  0.79503303,  0.79887775,  0.79882614,
          0.78725229]),
  'split1_train_score': array([ 0.90234884,  0.91528494,  0.91023316,  0.91048524,  0.91739785,
          0.9182774 ]),
  'split2_test_score': array([ 0.79080721,  0.78665087,  0.79387187,  0.7965336 ,  0.79332404,
          0.80056093]),
  'split2_train_score': array([ 0.9047034 ,  0.9039783 ,  0.90889014,  0.91210665,  0.91532365,
          0.91634698]),
  'split3_test_score': array([ 0.8352805 ,  0.83492719,  0.82686658,  0.84023477,  0.83811492,
          0.83271405]),
  'split3_train_score': array([ 0.89647656,  0.89508305,  0.9073727 ,  0.90980778,  0.91132584,
          0.91255982]),
  'split4_test_score': array([ 0.85502612,  0.86564128,  0.85294795,  0.85930553,  0.85898398,
          0.85014728]),
  'split4_train_score': array([ 0.89621629,  0.90328409,  0.9088987 ,  0.91539671,  0.91355176,
          0.91350773]),
  'std_fit_time': array([ 0.00880864,  0.01373345,  0.01708891,  0.01590903,  0.01735848,
          0.01942842]),
  'std_score_time': array([  7.64239137e-03,   7.64243030e-03,   9.53674316e-08,
           6.24003411e-03,   6.24003410e-03,   7.64239137e-03]),
  'std_test_score': array([ 0.02391805,  0.0270838 ,  0.02195879,  0.02420925,  0.0244629 ,
          0.0223639 ]),
  'std_train_score': array([ 0.00332188,  0.00643015,  0.00116535,  0.00206441,  0.00201162,
          0.00220641])},
 {'subsample': 0.7},
 0.82343789697662617)
  • 现在我们基本已经得到我们所有调优的参数结果了。这时我们可以减半步长,最大迭代次数加倍来增加我们模型的泛化能力。再次拟合我们的模型:

In [11]:

gbm2 = GradientBoostingClassifier(learning_rate=0.05, n_estimators=120,max_depth=7, min_samples_leaf =60, 
               min_samples_split =1200, max_features=9, subsample=0.7, random_state=10)
gbm2.fit(X,y)
y_pred = gbm2.predict(X)
y_predprob = gbm2.predict_proba(X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(y.values, y_pred))
print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
Accuracy : 0.984
AUC Score (Train): 0.905324
  • 可以看到AUC分数比起之前的版本稍有下降,这个原因是我们为了增加模型泛化能力,为防止过拟合而减半步长,最大迭代次数加倍,同时减小了子采样的比例,从而减少了训练集的拟合程度。

  • 下面我们继续将步长缩小5倍,最大迭代次数增加5倍,继续拟合我们的模型:

In [12]:

gbm3 = GradientBoostingClassifier(learning_rate=0.01, n_estimators=600,max_depth=7, min_samples_leaf =60, 
               min_samples_split =1200, max_features=9, subsample=0.7, random_state=10)
gbm3.fit(X,y)
y_pred = gbm3.predict(X)
y_predprob = gbm3.predict_proba(X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(y.values, y_pred))
print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
Accuracy : 0.984
AUC Score (Train): 0.908581
  • 最后我们继续步长缩小一半,最大迭代次数增加2倍,拟合我们的模型:

In [13]:

gbm4 = GradientBoostingClassifier(learning_rate=0.005, n_estimators=1200,max_depth=7, min_samples_leaf =60, 
               min_samples_split =1200, max_features=9, subsample=0.7, random_state=10)
gbm4.fit(X,y)
y_pred = gbm4.predict(X)
y_predprob = gbm4.predict_proba(X)[:,1]
print("Accuracy : %.4g" % metrics.accuracy_score(y.values, y_pred))
print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
Accuracy : 0.984
AUC Score (Train): 0.908232

随机森林RF参数调优

GBDT的框架参数比较多,重要的有最大迭代器个数,步长和子采样比例,调参起来比较费力。但是RF则比较简单,这是因为bagging框架里的各个弱学习器之间是没有依赖关系的,这减小的调参的难度。下面我来看看RF重要的Bagging框架的参数,由于RandomForestClassifier和RandomForestRegressor参数绝大部分相同。

  • 框架参数
  • n_estimators: 弱学习器的最大迭代次数,或者说最大的弱学习器的个数。
  • oob_score:即是否采用袋外样本来评估模型的好坏。默认识False。个人推荐设置为True,因为袋外分数反应了一个模型拟合后的泛化能力。
  • criterion: 即CART树做划分时对特征的评价标准。分类模型和回归模型的损失函数是不一样的。分类RF对应的CART分类树默认是基尼系数gini,另一个可选择的标准是信息增益。回归RF对应的CART回归树默认是均方差mse,另一个可以选择的标准是绝对值差mae。一般来说选择默认的标准就已经很好的。

  • RF重要的框架参数比较少,主要需要关注的是 n_estimators,即RF最大的决策树个数

  • 决策树参数
  • max_features: RF划分时考虑的最大特征数
  • max_depth: 决策树最大深度
  • min_samples_split: 内部节点再划分所需最小样本数
  • min_samples_leaf: 叶子节点最少样本数
  • min_weight_fraction_leaf: 叶子节点最小的样本权重和
  • max_leaf_nodes: 最大叶子节点数
  • min_impurity_split: 节点划分最小不纯度

  • 上面决策树参数中最重要的包括最大特征数max_features, 最大深度max_depth, 内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf。

  • 加载类库

In [14]:

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import cross_validation, metrics

import matplotlib.pylab as plt
%matplotlib inline
  • 用pandas观察数据

In [15]:

train = pd.read_csv('train_modified.csv')
target = 'Disbursed'
IDcol = 'ID'
train['Disbursed'].value_counts()

Out[15]:

0    19680
1      320
Name: Disbursed, dtype: int64
  • 选择训练样本和输出

In [16]:

x_columns = [x for x in train.columns if x not in[target, IDcol]]
X = train[x_columns]
y = train['Disbursed']
  • 默认参数

In [17]:

rf0 = RandomForestClassifier(oob_score=True, random_state=10)
rf0.fit(X,y)
print(rf0.oob_score_)
y_predprob = rf0.predict_proba(X)[:,1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
0.98005
AUC Score (Train): 0.999833
  • 对n_estimators进行网格搜索

In [18]:

param_test1 = {'n_estimators':list(range(10,71,10))}
gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100,
                                  min_samples_leaf=20,max_depth=8,max_features='sqrt' ,random_state=10), 
                       param_grid = param_test1, scoring='roc_auc',cv=5)
gsearch1.fit(X,y)
gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_

Out[18]:

({'mean_fit_time': array([ 0.10284119,  0.18108449,  0.24648037,  0.37708621,  0.42704334,
          0.50544086,  0.55536098]),
  'mean_score_time': array([ 0.00623999,  0.01456041,  0.01560016,  0.01508017,  0.02204003,
          0.03120017,  0.03431997]),
  'mean_test_score': array([ 0.80680934,  0.81600252,  0.81818272,  0.81838438,  0.82034069,
          0.82113345,  0.8199191 ]),
  'mean_train_score': array([ 0.8902114 ,  0.89959868,  0.90359284,  0.90555378,  0.90597112,
          0.90670245,  0.90710504]),
  'param_n_estimators': masked_array(data = [10 20 30 40 50 60 70],
               mask = [False False False False False False False],
         fill_value = ?),
  'params': [{'n_estimators': 10},
   {'n_estimators': 20},
   {'n_estimators': 30},
   {'n_estimators': 40},
   {'n_estimators': 50},
   {'n_estimators': 60},
   {'n_estimators': 70}],
  'rank_test_score': array([7, 6, 5, 4, 2, 1, 3]),
  'split0_test_score': array([ 0.81797431,  0.82673558,  0.8370927 ,  0.83676321,  0.8351753 ,
          0.83643769,  0.83286093]),
  'split0_train_score': array([ 0.88936373,  0.89866452,  0.9022285 ,  0.90198213,  0.90226423,
          0.90337899,  0.90328248]),
  'split1_test_score': array([ 0.78064461,  0.78217893,  0.79100967,  0.79112479,  0.7911367 ,
          0.7932903 ,  0.79317319]),
  'split1_train_score': array([ 0.89679191,  0.90442019,  0.90866399,  0.91072405,  0.90980517,
          0.91011506,  0.91099983]),
  'split2_test_score': array([ 0.77967996,  0.77394166,  0.7725582 ,  0.77300678,  0.77952514,
          0.77912022,  0.7801603 ]),
  'split2_train_score': array([ 0.89451909,  0.9047823 ,  0.90772365,  0.9090921 ,  0.90858509,
          0.90862169,  0.90853708]),
  'split3_test_score': array([ 0.82203538,  0.83827172,  0.83311103,  0.83438929,  0.83691605,
          0.84013156,  0.83880566]),
  'split3_train_score': array([ 0.88717552,  0.89682863,  0.90333111,  0.90672588,  0.90813179,
          0.90888977,  0.9096568 ]),
  'split4_test_score': array([ 0.83371245,  0.85888473,  0.85714201,  0.85663785,  0.85895024,
          0.85668747,  0.8545954 ]),
  'split4_train_score': array([ 0.88320675,  0.89329777,  0.89601694,  0.89924473,  0.90106933,
          0.90250676,  0.903049  ]),
  'std_fit_time': array([ 0.01848207,  0.00961718,  0.00623996,  0.08223287,  0.02709142,
          0.04368012,  0.01872001]),
  'std_score_time': array([  7.64239137e-03,   1.27347883e-03,   9.53674316e-08,
           1.03971960e-03,   7.48802833e-03,   2.13248060e-07,
           6.24008179e-03]),
  'std_test_score': array([ 0.02236454,  0.03275104,  0.03136316,  0.03117524,  0.03001429,
          0.02966341,  0.02836457]),
  'std_train_score': array([ 0.00491649,  0.00443541,  0.00451895,  0.00431708,  0.00357687,
          0.00312291,  0.00331044])},
 {'n_estimators': 60},
 0.82113344766260166)
  • 对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索。

In [19]:

param_test2 = {'max_depth':list(range(3,14,2)), 'min_samples_split':list(range(50,201,20))}
gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, 
                                  min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10),
   param_grid = param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(X,y)
gsearch2.cv_results_, gsearch2.best_params_, gsearch2.best_score_

Out[19]:

({'mean_fit_time': array([ 0.45684676,  0.4342411 ,  0.43368082,  0.43680077,  0.43680086,
          0.44928074,  0.41808085,  0.43368087,  0.55564117,  0.52104096,
          0.53120098,  0.5085608 ,  0.52104096,  0.52728086,  0.53808246,
          0.51344137,  0.57720103,  0.59008093,  0.59904108,  0.58404093,
          0.57408099,  0.57720103,  0.570961  ,  0.57408099,  0.636481  ,
          0.63856139,  0.64724126,  0.63044109,  0.63024101,  0.63648114,
          0.6333611 ,  0.63024116,  0.67080116,  0.6618412 ,  0.65520101,
          0.6552012 ,  0.66456118,  0.65832114,  0.65832114,  0.63648109,
          0.68952127,  0.69404125,  0.67724113,  0.67704124,  0.65520124,
          0.66456118,  0.65520124,  0.65520105]),
  'mean_score_time': array([ 0.01808033,  0.02496004,  0.02184005,  0.01560006,  0.01872001,
          0.02496014,  0.02808013,  0.0218399 ,  0.02496009,  0.02807999,
          0.02183995,  0.02808013,  0.02495999,  0.01872001,  0.02496004,
          0.03116035,  0.03120012,  0.03120017,  0.02808003,  0.02184014,
          0.03120008,  0.02184005,  0.03120003,  0.02808008,  0.03120012,
          0.03120012,  0.02807999,  0.03432021,  0.03120008,  0.02495995,
          0.02807999,  0.02808003,  0.03120003,  0.03120003,  0.03120027,
          0.03120003,  0.03120008,  0.03120022,  0.03120003,  0.03120017,
          0.03432002,  0.03744001,  0.03120017,  0.03744006,  0.03120003,
          0.03120003,  0.03120003,  0.03120012]),
  'mean_test_score': array([ 0.79379248,  0.79338637,  0.79350308,  0.79366624,  0.79387425,
          0.79372817,  0.7937758 ,  0.79349474,  0.80960366,  0.80919874,
          0.80887759,  0.80922971,  0.80822853,  0.80801019,  0.80792286,
          0.80771167,  0.81687627,  0.81872459,  0.81501127,  0.81475522,
          0.81557061,  0.81458651,  0.81601007,  0.81703824,  0.82090439,
          0.81907989,  0.82035736,  0.81889291,  0.81991314,  0.81787625,
          0.81897588,  0.81745943,  0.82395238,  0.82380272,  0.81952728,
          0.82253557,  0.81950267,  0.8188687 ,  0.81910371,  0.81563969,
          0.82290754,  0.82176623,  0.82415365,  0.82420168,  0.8220854 ,
          0.81852491,  0.81954594,  0.82092345]),
  'mean_train_score': array([ 0.82384314,  0.82333406,  0.82317996,  0.82334009,  0.82278031,
          0.82227885,  0.82230433,  0.82203672,  0.86685836,  0.866551  ,
          0.86475867,  0.8633599 ,  0.86141916,  0.86077546,  0.86043597,
          0.85972952,  0.9001969 ,  0.89648326,  0.89387619,  0.89111177,
          0.89000505,  0.88648389,  0.88449675,  0.88333465,  0.92445158,
          0.91957807,  0.91649202,  0.91247023,  0.90955617,  0.90597963,
          0.90288312,  0.90016919,  0.9396342 ,  0.93361663,  0.92901996,
          0.92452097,  0.9203458 ,  0.91649912,  0.91281736,  0.90914974,
          0.94742174,  0.94158583,  0.93517471,  0.93039251,  0.92589124,
          0.92226143,  0.91770255,  0.91470253]),
  'param_max_depth': masked_array(data = [3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 11 11 11
   11 11 11 11 11 13 13 13 13 13 13 13 13],
               mask = [False False False False False False False False False False False False
   False False False False False False False False False False False False
   False False False False False False False False False False False False
   False False False False False False False False False False False False],
         fill_value = ?),
  'param_min_samples_split': masked_array(data = [50 70 90 110 130 150 170 190 50 70 90 110 130 150 170 190 50 70 90 110 130
   150 170 190 50 70 90 110 130 150 170 190 50 70 90 110 130 150 170 190 50
   70 90 110 130 150 170 190],
               mask = [False False False False False False False False False False False False
   False False False False False False False False False False False False
   False False False False False False False False False False False False
   False False False False False False False False False False False False],
         fill_value = ?),
  'params': [{'max_depth': 3, 'min_samples_split': 50},
   {'max_depth': 3, 'min_samples_split': 70},
   {'max_depth': 3, 'min_samples_split': 90},
   {'max_depth': 3, 'min_samples_split': 110},
   {'max_depth': 3, 'min_samples_split': 130},
   {'max_depth': 3, 'min_samples_split': 150},
   {'max_depth': 3, 'min_samples_split': 170},
   {'max_depth': 3, 'min_samples_split': 190},
   {'max_depth': 5, 'min_samples_split': 50},
   {'max_depth': 5, 'min_samples_split': 70},
   {'max_depth': 5, 'min_samples_split': 90},
   {'max_depth': 5, 'min_samples_split': 110},
   {'max_depth': 5, 'min_samples_split': 130},
   {'max_depth': 5, 'min_samples_split': 150},
   {'max_depth': 5, 'min_samples_split': 170},
   {'max_depth': 5, 'min_samples_split': 190},
   {'max_depth': 7, 'min_samples_split': 50},
   {'max_depth': 7, 'min_samples_split': 70},
   {'max_depth': 7, 'min_samples_split': 90},
   {'max_depth': 7, 'min_samples_split': 110},
   {'max_depth': 7, 'min_samples_split': 130},
   {'max_depth': 7, 'min_samples_split': 150},
   {'max_depth': 7, 'min_samples_split': 170},
   {'max_depth': 7, 'min_samples_split': 190},
   {'max_depth': 9, 'min_samples_split': 50},
   {'max_depth': 9, 'min_samples_split': 70},
   {'max_depth': 9, 'min_samples_split': 90},
   {'max_depth': 9, 'min_samples_split': 110},
   {'max_depth': 9, 'min_samples_split': 130},
   {'max_depth': 9, 'min_samples_split': 150},
   {'max_depth': 9, 'min_samples_split': 170},
   {'max_depth': 9, 'min_samples_split': 190},
   {'max_depth': 11, 'min_samples_split': 50},
   {'max_depth': 11, 'min_samples_split': 70},
   {'max_depth': 11, 'min_samples_split': 90},
   {'max_depth': 11, 'min_samples_split': 110},
   {'max_depth': 11, 'min_samples_split': 130},
   {'max_depth': 11, 'min_samples_split': 150},
   {'max_depth': 11, 'min_samples_split': 170},
   {'max_depth': 11, 'min_samples_split': 190},
   {'max_depth': 13, 'min_samples_split': 50},
   {'max_depth': 13, 'min_samples_split': 70},
   {'max_depth': 13, 'min_samples_split': 90},
   {'max_depth': 13, 'min_samples_split': 110},
   {'max_depth': 13, 'min_samples_split': 130},
   {'max_depth': 13, 'min_samples_split': 150},
   {'max_depth': 13, 'min_samples_split': 170},
   {'max_depth': 13, 'min_samples_split': 190}],
  'rank_test_score': array([42, 48, 46, 45, 41, 44, 43, 47, 33, 35, 36, 34, 37, 38, 39, 40, 26,
         21, 30, 31, 29, 32, 27, 25, 10, 17, 11, 19, 12, 23, 18, 24,  3,  4,
         14,  6, 15, 20, 16, 28,  5,  8,  2,  1,  7, 22, 13,  9]),
  'split0_test_score': array([ 0.80645405,  0.80638458,  0.80678949,  0.80734923,  0.80778193,
          0.80785339,  0.80785339,  0.80702768,  0.82106477,  0.81818868,
          0.81633281,  0.82047328,  0.81733319,  0.81232732,  0.81567383,
          0.81771429,  0.83755518,  0.82971489,  0.82488567,  0.82656488,
          0.83121348,  0.82325807,  0.82694399,  0.82772008,  0.82574711,
          0.82930998,  0.8262215 ,  0.82408973,  0.83039769,  0.82074322,
          0.82633464,  0.8295164 ,  0.83015554,  0.82926631,  0.81959794,
          0.82797018,  0.83578268,  0.82410363,  0.82615401,  0.82353794,
          0.82335731,  0.82299606,  0.83206102,  0.82738464,  0.82751763,
          0.81720219,  0.83011981,  0.82965931]),
  'split0_train_score': array([ 0.8199056 ,  0.81961866,  0.81968342,  0.81994617,  0.82035455,
          0.81904242,  0.81904242,  0.81833121,  0.86326425,  0.86446511,
          0.8613383 ,  0.86196961,  0.85909538,  0.85840725,  0.85987259,
          0.85879901,  0.90103087,  0.89808431,  0.89313426,  0.89194805,
          0.89041857,  0.88675076,  0.885504  ,  0.88236354,  0.92267106,
          0.91591973,  0.91719998,  0.9107702 ,  0.90953684,  0.90563121,
          0.90022638,  0.90042958,  0.93913232,  0.93375328,  0.92778611,
          0.92410365,  0.9194728 ,  0.91815124,  0.91254468,  0.90882737,
          0.94684483,  0.94073127,  0.93489   ,  0.93095398,  0.92845861,
          0.92187996,  0.91752984,  0.91488685]),
  'split1_test_score': array([ 0.77140696,  0.77074203,  0.77089288,  0.77044827,  0.77052369,
          0.77080158,  0.7706666 ,  0.77084524,  0.78410029,  0.78509472,
          0.78282004,  0.7806585 ,  0.78206182,  0.78237146,  0.78104556,
          0.78182363,  0.78862583,  0.79442565,  0.79122999,  0.79462613,
          0.79867727,  0.7938679 ,  0.79050948,  0.79537443,  0.79954268,
          0.79698417,  0.79467575,  0.79319701,  0.7937111 ,  0.79515411,
          0.79755383,  0.79668445,  0.79763124,  0.8048979 ,  0.80675376,
          0.80675178,  0.79414182,  0.79613861,  0.79411601,  0.79871697,
          0.80570178,  0.80474903,  0.8041754 ,  0.80042199,  0.79921915,
          0.80355215,  0.79351856,  0.80418334]),
  'split1_train_score': array([ 0.83115046,  0.83001907,  0.82960895,  0.83032388,  0.82878572,
          0.82801918,  0.82797688,  0.82815738,  0.87250686,  0.87225801,
          0.8691338 ,  0.86591556,  0.86598987,  0.86498713,  0.86456733,
          0.86448632,  0.89966813,  0.89834483,  0.89620724,  0.8928833 ,
          0.89248645,  0.88867845,  0.88690483,  0.8861342 ,  0.92438203,
          0.91925223,  0.91333057,  0.91296672,  0.90777724,  0.90387062,
          0.90171839,  0.89864579,  0.93922375,  0.93070835,  0.93060141,
          0.92420625,  0.9191438 ,  0.91688649,  0.91232734,  0.9098672 ,
          0.94586095,  0.94214549,  0.93541575,  0.92989318,  0.92485791,
          0.9238115 ,  0.9197245 ,  0.91617652]),
  'split2_test_score': array([ 0.76001969,  0.75840995,  0.75735796,  0.75728849,  0.75699076,
          0.75644293,  0.75645087,  0.75569264,  0.77379279,  0.77263362,
          0.77588089,  0.77176027,  0.77225649,  0.77308618,  0.77190716,
          0.77227237,  0.77339185,  0.78255407,  0.7725185 ,  0.77543628,
          0.76869561,  0.76960072,  0.77574393,  0.77508694,  0.78276844,
          0.78255804,  0.79083897,  0.77801266,  0.77980699,  0.78613083,
          0.78205388,  0.77613893,  0.7953645 ,  0.79264521,  0.78095028,
          0.78752819,  0.78192685,  0.78214121,  0.78205388,  0.77442597,
          0.79592027,  0.78565049,  0.78867942,  0.79675392,  0.79044199,
          0.78651986,  0.77964026,  0.78055728]),
  'split2_train_score': array([ 0.82949172,  0.82895493,  0.82858599,  0.8286407 ,  0.82866824,
          0.82842459,  0.82851342,  0.82770607,  0.8689497 ,  0.86925711,
          0.86947297,  0.86757604,  0.86624605,  0.86581421,  0.86477066,
          0.86359548,  0.89762854,  0.89460767,  0.89348149,  0.88879097,
          0.8874776 ,  0.8832091 ,  0.88287813,  0.8815227 ,  0.92308938,
          0.91981246,  0.91850988,  0.91254171,  0.91263524,  0.90891347,
          0.90450293,  0.89900059,  0.93784649,  0.93131424,  0.92813185,
          0.92386596,  0.92056175,  0.91424337,  0.91250102,  0.90988792,
          0.94720633,  0.94109264,  0.9368306 ,  0.93117145,  0.92378297,
          0.92243213,  0.91752364,  0.91490769]),
  'split3_test_score': array([ 0.81382987,  0.81422685,  0.81529472,  0.81539793,  0.81531456,
          0.81478262,  0.81520738,  0.81520738,  0.8283215 ,  0.82721394,
          0.83115195,  0.83476443,  0.83391689,  0.83661435,  0.83385536,
          0.83206499,  0.8365409 ,  0.83656869,  0.83990925,  0.83462748,
          0.83737059,  0.83737456,  0.83847021,  0.83816652,  0.84034593,
          0.83634043,  0.8363583 ,  0.84499254,  0.83918874,  0.8365151 ,
          0.8351495 ,  0.83379581,  0.83657465,  0.83618164,  0.83311698,
          0.83366282,  0.82829768,  0.83853174,  0.83711652,  0.83276169,
          0.83462748,  0.83735669,  0.83733883,  0.83343059,  0.83400025,
          0.83702522,  0.8360685 ,  0.83452426]),
  'split3_train_score': array([ 0.8201583 ,  0.81942203,  0.81917653,  0.81918074,  0.8185132 ,
          0.81832824,  0.81825083,  0.81825083,  0.86795515,  0.86490426,
          0.86421005,  0.86303885,  0.86033742,  0.86166171,  0.86140008,
          0.85983165,  0.9034007 ,  0.8997577 ,  0.89679377,  0.89580505,
          0.89491222,  0.89132591,  0.88648639,  0.88655413,  0.92975623,
          0.92378644,  0.91813163,  0.91625108,  0.91217562,  0.90752764,
          0.90686134,  0.90497198,  0.94315096,  0.93769836,  0.93425583,
          0.92898584,  0.92501943,  0.91946299,  0.91892261,  0.91114894,
          0.95147321,  0.94360885,  0.93726826,  0.93088538,  0.92738702,
          0.9263452 ,  0.91978157,  0.91679097]),
  'split4_test_score': array([ 0.81725181,  0.81716845,  0.81718035,  0.81784728,  0.81876032,
          0.81876032,  0.81870077,  0.81870077,  0.84073893,  0.84286276,
          0.83820225,  0.83849204,  0.83557427,  0.83565168,  0.8371324 ,
          0.83468305,  0.84826759,  0.85035966,  0.84651296,  0.84252136,
          0.84189612,  0.8488313 ,  0.84838272,  0.84884321,  0.85611781,
          0.85020682,  0.85369228,  0.85417262,  0.85646119,  0.85083802,
          0.85378755,  0.85116155,  0.86003597,  0.85602253,  0.85721743,
          0.85676488,  0.85736431,  0.85342829,  0.85607811,  0.84875588,
          0.85493085,  0.85807887,  0.85851356,  0.86301726,  0.85924797,
          0.84832516,  0.85838256,  0.85569304]),
  'split4_train_score': array([ 0.81850961,  0.81865562,  0.81884493,  0.81860897,  0.81757981,
          0.81757981,  0.81773811,  0.81773811,  0.86161581,  0.86187049,
          0.85963825,  0.85829944,  0.85542707,  0.853007  ,  0.8515692 ,
          0.85193516,  0.89925626,  0.89162178,  0.88976418,  0.88613147,
          0.88473039,  0.88245522,  0.88071038,  0.88009867,  0.92235919,
          0.91911949,  0.91528804,  0.90982143,  0.9056559 ,  0.90395523,
          0.90110655,  0.897798  ,  0.93881747,  0.93460889,  0.92432459,
          0.92144316,  0.91753121,  0.91375149,  0.90779114,  0.90601727,
          0.94572337,  0.94035091,  0.93146893,  0.92905854,  0.92496968,
          0.91683836,  0.91395321,  0.9107506 ]),
  'std_fit_time': array([  2.35523977e-02,   1.89066728e-02,   6.24015332e-03,
           9.53674316e-08,   9.86628483e-03,   3.33125278e-02,
           6.24003410e-03,   1.16739202e-02,   3.65038530e-02,
           1.24801159e-02,   2.52062134e-02,   7.64248870e-03,
           1.59089717e-02,   1.81926285e-02,   4.04526618e-02,
           1.26359008e-02,   9.86628483e-03,   1.23247585e-02,
           1.59089998e-02,   7.23586648e-03,   6.24015331e-03,
           9.86628483e-03,   7.64246923e-03,   1.16739074e-02,
           6.23998642e-03,   2.21601937e-02,   1.81887412e-02,
           7.89368088e-03,   7.64239137e-03,   1.16739839e-02,
           2.33480442e-02,   1.24800205e-02,   9.86628483e-03,
           8.15687068e-03,   1.70889937e-02,   1.39531404e-02,
           1.87200387e-02,   6.24012947e-03,   1.16739584e-02,
           1.16740986e-02,   6.24003410e-03,   1.37768858e-02,
           7.48801101e-03,   7.64250817e-03,   9.86636022e-03,
           2.89337553e-02,   9.86628483e-03,   9.86643562e-03]),
  'std_score_time': array([  4.03559881e-03,   7.64237190e-03,   7.64250817e-03,
           1.16800773e-07,   6.23996258e-03,   7.64244977e-03,
           6.23996258e-03,   7.64243030e-03,   7.64231350e-03,
           6.23989105e-03,   7.64239137e-03,   6.23996258e-03,
           7.64243030e-03,   6.23996258e-03,   7.64237190e-03,
           7.92742313e-05,   1.78416128e-07,   1.50789149e-07,
           6.23991489e-03,   7.64243030e-03,   1.90734863e-07,
           7.64231350e-03,   1.16800773e-07,   6.24005795e-03,
           9.53674316e-08,   1.78416128e-07,   6.24001026e-03,
           6.23996258e-03,   1.90734863e-07,   7.64248870e-03,
           6.24001026e-03,   6.24003410e-03,   1.16800773e-07,
           1.16800773e-07,   1.16800773e-07,   1.16800773e-07,
           1.90734863e-07,   9.53674316e-08,   1.16800773e-07,
           1.50789149e-07,   6.23993874e-03,   7.64250817e-03,
           2.13248060e-07,   7.64246923e-03,   1.16800773e-07,
           1.16800773e-07,   1.16800773e-07,   1.78416128e-07]),
  'std_test_score': array([ 0.02346855,  0.02410387,  0.02461588,  0.0249264 ,  0.02521138,
          0.0252398 ,  0.02532165,  0.02542409,  0.02601523,  0.02629313,
          0.02521682,  0.02776687,  0.02634107,  0.02637392,  0.02685252,
          0.02587171,  0.02996235,  0.02584075,  0.02856905,  0.02552055,
          0.0279128 ,  0.02905229,  0.0280843 ,  0.02757279,  0.02665364,
          0.02527265,  0.02421783,  0.02927227,  0.02867851,  0.0243564 ,
          0.02588331,  0.02715535,  0.02453526,  0.02258047,  0.02552089,
          0.0236628 ,  0.02768036,  0.02635895,  0.02734355,  0.02621898,
          0.02091603,  0.02512813,  0.0247973 ,  0.02416943,  0.02480605,
          0.02227363,  0.02885471,  0.02599954]),
  'std_train_score': array([ 0.00534476,  0.00504539,  0.0048498 ,  0.00506107,  0.00493701,
          0.00487615,  0.00487138,  0.00481966,  0.00394675,  0.00371114,
          0.00398961,  0.0032234 ,  0.0041633 ,  0.00468764,  0.00481091,
          0.00445424,  0.00193498,  0.00296319,  0.00251241,  0.00334825,
          0.00359496,  0.00332526,  0.00235494,  0.00256516,  0.00274037,
          0.0025086 ,  0.0019342 ,  0.00221146,  0.00263512,  0.00198349,
          0.00245035,  0.0025473 ,  0.00182488,  0.00250691,  0.00329463,
          0.00245399,  0.00253055,  0.00220463,  0.00354828,  0.00173048,
          0.00210305,  0.00117511,  0.00204924,  0.00079921,  0.00174302,
          0.00311975,  0.00212274,  0.00210846])},
 {'max_depth': 13, 'min_samples_split': 110},
 0.82420168000508132)
  • 模型的袋外分数

In [20]:

rf1 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=110,
                                  min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10)
rf1.fit(X,y)
print(rf1.oob_score_)
0.984
  • 对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参

In [21]:

param_test3 = {'min_samples_split':list(range(80,150,20)), 'min_samples_leaf':list(range(10,60,10))}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13,
                                  max_features='sqrt' ,oob_score=True, random_state=10),
   param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(X,y)
gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_

Out[21]:

({'mean_fit_time': array([ 0.74672794,  0.72512503,  0.68016124,  0.67744122,  0.67080116,
          0.68016138,  0.69264107,  0.68328109,  0.66848121,  0.66972165,
          0.66456127,  0.66516123,  0.65832119,  0.65832114,  0.64584126,
          0.64624114,  0.63336105,  0.63648114,  0.64312096,  0.6333612 ]),
  'mean_score_time': array([ 0.0388402 ,  0.03432016,  0.03119998,  0.03432002,  0.03119998,
          0.03119998,  0.03120012,  0.03120003,  0.03120003,  0.03120003,
          0.03120003,  0.03120008,  0.03120022,  0.03120012,  0.03120003,
          0.03119993,  0.03120008,  0.03119998,  0.03120012,  0.03119998]),
  'mean_test_score': array([ 0.8209294 ,  0.81913348,  0.82048399,  0.8179751 ,  0.8209429 ,
          0.82097426,  0.82486503,  0.82169239,  0.82352087,  0.82164475,
          0.82069876,  0.82141332,  0.82278249,  0.82141411,  0.82042881,
          0.82162093,  0.82224975,  0.82224975,  0.81890403,  0.81916643]),
  'mean_train_score': array([ 0.94798589,  0.9403695 ,  0.93560421,  0.93083079,  0.93868034,
          0.93154815,  0.92847774,  0.92446736,  0.92980166,  0.92590865,
          0.92153901,  0.91734287,  0.91917779,  0.91832166,  0.91390083,
          0.91164489,  0.91021282,  0.91021282,  0.90766924,  0.90718711]),
  'param_min_samples_leaf': masked_array(data = [10 10 10 10 20 20 20 20 30 30 30 30 40 40 40 40 50 50 50 50],
               mask = [False False False False False False False False False False False False
   False False False False False False False False],
         fill_value = ?),
  'param_min_samples_split': masked_array(data = [80 100 120 140 80 100 120 140 80 100 120 140 80 100 120 140 80 100 120 140],
               mask = [False False False False False False False False False False False False
   False False False False False False False False],
         fill_value = ?),
  'params': [{'min_samples_leaf': 10, 'min_samples_split': 80},
   {'min_samples_leaf': 10, 'min_samples_split': 100},
   {'min_samples_leaf': 10, 'min_samples_split': 120},
   {'min_samples_leaf': 10, 'min_samples_split': 140},
   {'min_samples_leaf': 20, 'min_samples_split': 80},
   {'min_samples_leaf': 20, 'min_samples_split': 100},
   {'min_samples_leaf': 20, 'min_samples_split': 120},
   {'min_samples_leaf': 20, 'min_samples_split': 140},
   {'min_samples_leaf': 30, 'min_samples_split': 80},
   {'min_samples_leaf': 30, 'min_samples_split': 100},
   {'min_samples_leaf': 30, 'min_samples_split': 120},
   {'min_samples_leaf': 30, 'min_samples_split': 140},
   {'min_samples_leaf': 40, 'min_samples_split': 80},
   {'min_samples_leaf': 40, 'min_samples_split': 100},
   {'min_samples_leaf': 40, 'min_samples_split': 120},
   {'min_samples_leaf': 40, 'min_samples_split': 140},
   {'min_samples_leaf': 50, 'min_samples_split': 80},
   {'min_samples_leaf': 50, 'min_samples_split': 100},
   {'min_samples_leaf': 50, 'min_samples_split': 120},
   {'min_samples_leaf': 50, 'min_samples_split': 140}],
  'rank_test_score': array([13, 18, 15, 20, 12, 11,  1,  6,  2,  7, 14, 10,  3,  9, 16,  8,  4,
          4, 19, 17]),
  'split0_test_score': array([ 0.82845449,  0.82321638,  0.83177123,  0.82597934,  0.81923669,
          0.82899438,  0.83293834,  0.83292842,  0.83152709,  0.82994712,
          0.82897453,  0.82060626,  0.82249389,  0.8246058 ,  0.82937151,
          0.82820241,  0.83519118,  0.83519118,  0.83139013,  0.82688842]),
  'split0_train_score': array([ 0.94573813,  0.93956205,  0.93367277,  0.93046781,  0.93598318,
          0.93093922,  0.92845625,  0.92546317,  0.93011251,  0.92690593,
          0.9225145 ,  0.91691937,  0.91901516,  0.91750429,  0.91413321,
          0.91021505,  0.91247373,  0.91247373,  0.90874884,  0.90758272]),
  'split1_test_score': array([ 0.7987527 ,  0.79892737,  0.80212899,  0.80147794,  0.8070376 ,
          0.80161292,  0.79838748,  0.80480858,  0.80123579,  0.80077331,
          0.80549932,  0.8038082 ,  0.81101134,  0.80173598,  0.80329213,
          0.80533259,  0.80047756,  0.80047756,  0.79987813,  0.79849466]),
  'split1_train_score': array([ 0.94726922,  0.94095916,  0.93712895,  0.93209169,  0.94063698,
          0.93314405,  0.92806709,  0.92445684,  0.93197508,  0.92699128,
          0.9227182 ,  0.91803686,  0.91820321,  0.91551295,  0.91598771,
          0.91267494,  0.90997389,  0.90997389,  0.90736153,  0.90782476]),
  'split2_test_score': array([ 0.79033878,  0.79120419,  0.78602563,  0.78658735,  0.7861348 ,
          0.7875004 ,  0.80341916,  0.78557109,  0.79268491,  0.78703593,
          0.7803707 ,  0.78780607,  0.78845314,  0.78461041,  0.7780722 ,
          0.78574179,  0.78640871,  0.78640871,  0.78120038,  0.78213327]),
  'split2_train_score': array([ 0.94528025,  0.9384319 ,  0.9343882 ,  0.92999776,  0.93773149,
          0.93141795,  0.92908794,  0.92318279,  0.92840899,  0.92484711,
          0.91954636,  0.91712704,  0.91932802,  0.91898067,  0.91216433,
          0.91036429,  0.90849391,  0.90849391,  0.90786855,  0.90896309]),
  'split3_test_score': array([ 0.83617966,  0.83349609,  0.83135837,  0.83130875,  0.82981811,
          0.83488551,  0.8346374 ,  0.83021905,  0.84109621,  0.84205491,
          0.83554449,  0.83417095,  0.83508003,  0.84203705,  0.83420668,
          0.83663022,  0.84062381,  0.84062381,  0.8338236 ,  0.83752938]),
  'split3_train_score': array([ 0.95388955,  0.94436348,  0.93799734,  0.93502944,  0.9404736 ,
          0.93425075,  0.93156185,  0.92803893,  0.93146558,  0.92710045,
          0.92384884,  0.9203382 ,  0.92191023,  0.9206048 ,  0.91815831,
          0.91558565,  0.91336419,  0.91336419,  0.91151565,  0.90790986]),
  'split4_test_score': array([ 0.85092138,  0.84882336,  0.85113575,  0.84452212,  0.8624873 ,
          0.8518781 ,  0.85494276,  0.85493482,  0.85106032,  0.84841249,
          0.85310475,  0.8606751 ,  0.85687405,  0.85408132,  0.85720155,
          0.85219766,  0.84854746,  0.84854746,  0.8482279 ,  0.85078641]),
  'split4_train_score': array([ 0.94775229,  0.9385309 ,  0.93483381,  0.92656726,  0.93857643,
          0.92798881,  0.92521556,  0.92119505,  0.92704612,  0.92369849,
          0.91906713,  0.91429287,  0.91743234,  0.9190056 ,  0.90906059,
          0.9093845 ,  0.90675838,  0.90675838,  0.90285163,  0.90365514]),
  'std_fit_time': array([ 0.06113817,  0.05438415,  0.01590904,  0.01280165,  0.        ,
          0.01248009,  0.04478048,  0.0206959 ,  0.0110957 ,  0.02208383,
          0.0159089 ,  0.01208126,  0.00623999,  0.01167396,  0.00764216,
          0.00815686,  0.00764233,  0.01167398,  0.01104286,  0.00764245]),
  'std_score_time': array([  9.61536962e-03,   6.23998642e-03,   9.53674316e-08,
           6.24005795e-03,   9.53674316e-08,   9.53674316e-08,
           1.78416128e-07,   1.16800773e-07,   1.90734863e-07,
           1.90734863e-07,   1.16800773e-07,   1.16800773e-07,
           1.78416128e-07,   9.86628483e-03,   1.16800773e-07,
           0.00000000e+00,   1.16800773e-07,   9.53674316e-08,
           1.78416128e-07,   9.53674316e-08]),
  'std_test_score': array([ 0.02287491,  0.0214139 ,  0.02327861,  0.02099498,  0.02534789,
          0.02327339,  0.02110134,  0.02405753,  0.02271077,  0.02381345,
          0.02528402,  0.02507702,  0.02293736,  0.02547305,  0.02723889,
          0.02347831,  0.02431158,  0.02431158,  0.02458429,  0.02528014]),
  'std_train_score': array([ 0.00309174,  0.00219482,  0.0016646 ,  0.00276485,  0.00174528,
          0.00214045,  0.00203445,  0.00228499,  0.00185049,  0.00138555,
          0.00188458,  0.00194844,  0.00151734,  0.00171299,  0.00312981,
          0.00225319,  0.00244899,  0.00244899,  0.00280372,  0.00182835])},
 {'min_samples_leaf': 20, 'min_samples_split': 120},
 0.82486502794715444)
  • 对最大特征数max_features做调参:

In [22]:

param_test4 = {'max_features':list(range(3,11,2))}
gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120,
                                  min_samples_leaf=20 ,oob_score=True, random_state=10),
   param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(X,y)
gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_

Out[22]:

({'mean_fit_time': array([ 0.51988993,  0.59320097,  0.71820383,  0.83897982]),
  'mean_score_time': array([ 0.02948017,  0.03119993,  0.03432007,  0.03436146]),
  'mean_test_score': array([ 0.81981191,  0.8163868 ,  0.82486503,  0.81703506]),
  'mean_train_score': array([ 0.90445415,  0.91814913,  0.92847774,  0.9330581 ]),
  'param_max_features': masked_array(data = [3 5 7 9],
               mask = [False False False False],
         fill_value = ?),
  'params': [{'max_features': 3},
   {'max_features': 5},
   {'max_features': 7},
   {'max_features': 9}],
  'rank_test_score': array([2, 4, 1, 3]),
  'split0_test_score': array([ 0.81893102,  0.82697972,  0.83293834,  0.81775994]),
  'split0_train_score': array([ 0.8989037 ,  0.91926364,  0.92845625,  0.93346386]),
  'split1_test_score': array([ 0.79912387,  0.79626763,  0.79838748,  0.80414563]),
  'split1_train_score': array([ 0.90922633,  0.91967004,  0.92806709,  0.93307582]),
  'split2_test_score': array([ 0.78474935,  0.77782211,  0.80341916,  0.78332023]),
  'split2_train_score': array([ 0.90551398,  0.92046462,  0.92908794,  0.93042327]),
  'split3_test_score': array([ 0.84169565,  0.83561793,  0.8346374 ,  0.83346434]),
  'split3_train_score': array([ 0.90697771,  0.91720246,  0.93156185,  0.93631056]),
  'split4_test_score': array([ 0.85455967,  0.8452466 ,  0.85494276,  0.84648517]),
  'split4_train_score': array([ 0.90164904,  0.91414487,  0.92521556,  0.93201701]),
  'std_fit_time': array([ 0.04146562,  0.01943034,  0.04183679,  0.04340904]),
  'std_score_time': array([ 0.00343976,  0.        ,  0.00624003,  0.00178675]),
  'std_test_score': array([ 0.02586294,  0.02532568,  0.02110134,  0.02209336]),
  'std_train_score': array([ 0.00371326,  0.00227363,  0.00203445,  0.00193751])},
 {'max_features': 7},
 0.82486502794715444)
  • 用我们搜索到的最佳参数,看最终的模型拟合:

In [23]:

rf2 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120,
                                  min_samples_leaf=20,max_features=7 ,oob_score=True, random_state=10)
rf2.fit(X,y)
print(rf2.oob_score_)
0.984

可见此时模型的袋外分数基本没有提高,主要原因是0.984已经是一个很高的袋外分数了,如果想进一步需要提高模型的泛化能力,我们需要更多的数据。

猜你喜欢

转载自blog.csdn.net/kepengs/article/details/86597105