模型调参 | 网格搜索(GridSearchCV)与随机搜索(RandomizedSearchCV)

模型微调

1.网格搜索

微调的一种方法是手工调整超参数,直到找到一个好的超参数组合。这么做的话会非常冗长,你也可能没有时间探索多种组合。
你应该使用 Scikit-Learn 的 GridSearchCV 来做这项搜索工作。你所需要做的是告
GridSearchCV 要试验有哪些超参数,要试验什么值, GridSearchCV 就能用交叉验证试验所有可能超参数值的组合。例如,下面的代码搜索了 RandomForestRegressor 超参数值的最佳组合:

from sklearn.model_selection import GridSearchCV

param_grid = [
    # 尝试 12 (3×4) 的超参数组合
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # 在bootstrap设置为False的情况下,尝试6 (2×3) 的超参数组合
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# 尝试5折交叉验证, 总共进行了(12+6)*5=90训练 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
# 数据集是来自房价预测
grid_search.fit(housing_prepared, housing_labels)

param_grid 告诉 Scikit-Learn 首先评估所有的列在第一个 dict 中的 n_estimatorsmax_features 的 3 × 4 = 12 种组合。然后尝试第二个 dict 中超参数的 2 × 3 = 6 种组合,这次会将超参数 bootstrap 设为 False 而不是 True (后者是该超参数的默认值)。
总之,网格搜索会探索 12 + 6 = 18 种 RandomForestRegressor 的超参数组合,会训练每个模型五次(因为用的是五折交叉验证)。换句话说,训练总共有 18 × 5 = 90 轮! K 折运算将要花费大量时间,完成后,你就能获得参数的最佳组合,如下所示:

grid_search.best_params_ 
{'max_features': 6, 'n_estimators': 30}

因为 30 是 n_estimators 的最大值,你也应该估计更高的值,因为评估的分数可能会随 n_estimators 的增大而持续提升。
还能直接得到最佳的估计器:

grid_search.best_estimator_
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=None, oob_score=False, random_state=42,
           verbose=0, warm_start=False)

如果 GridSearchCV 是以(默认值) refit=True 开始运行的,则一旦用交叉验证找到了最佳的估计器,就会在整个训练集上重新训练。这是一个好方法,因为用更多数据训练会提高性能。
当然,也可以得到评估得分:

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
	print(np.sqrt(-mean_score), params)
63669.05791727153 {'max_features': 2, 'n_estimators': 3}
55627.16171305252 {'max_features': 2, 'n_estimators': 10}
53384.57867637289 {'max_features': 2, 'n_estimators': 30}
60965.99185930139 {'max_features': 4, 'n_estimators': 3}
52740.98248528835 {'max_features': 4, 'n_estimators': 10}
50377.344409590376 {'max_features': 4, 'n_estimators': 30}
58663.84733372485 {'max_features': 6, 'n_estimators': 3}
52006.15355973719 {'max_features': 6, 'n_estimators': 10}
50146.465964159885 {'max_features': 6, 'n_estimators': 30}
57869.25504027614 {'max_features': 8, 'n_estimators': 3}
51711.09443660957 {'max_features': 8, 'n_estimators': 10}
49682.25345942335 {'max_features': 8, 'n_estimators': 30}
62895.088889905004 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.14484390074 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.399594730654 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52725.01091081235 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.612956065226 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51009.51445842374 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

在这个例子中,我们通过设定超参数max_features为 6,n_estimators为 30,得到了最佳方案。对这个组合,RMSE 的值是 49959。

2.网格搜索

当探索相对较少的组合时,就像前面的例子,网格搜索还可以。但是当超参数的搜索空间很大时,最好使用RandomizedSearchCV 。这个类的使用方法和类 GridSearchCV 很相似,但它不是尝试所有可能的组合,而是通过选择每个超参数的一个随机值的特定数量的随机组合。这个方法有两个优点:

  • 如果你让随机搜索运行,比如 1000 次,它会探索每个超参数的 1000 个不同的值(而不是像网格搜索那样,只搜索每个超参数的几个值)。
  • 你可以方便地通过设定搜索次数,控制超参数搜索的计算量。
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
# 数据集是来自房价预测
rnd_search.fit(housing_prepared, housing_labels)
RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=10, n_jobs=None,
          param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1210939e8>, 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x121093710>},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring='neg_mean_squared_error',
          verbose=0)
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
49150.657232934034 {'max_features': 7, 'n_estimators': 180}
51389.85295710133 {'max_features': 5, 'n_estimators': 15}
50796.12045980556 {'max_features': 3, 'n_estimators': 72}
50835.09932039744 {'max_features': 5, 'n_estimators': 21}
49280.90117886215 {'max_features': 7, 'n_estimators': 122}
50774.86679035961 {'max_features': 3, 'n_estimators': 75}
50682.75001237282 {'max_features': 3, 'n_estimators': 88}
49608.94061293652 {'max_features': 5, 'n_estimators': 100}
50473.57642831875 {'max_features': 3, 'n_estimators': 150}
64429.763804893395 {'max_features': 5, 'n_estimators': 2}
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
       1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
       5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
       1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])

参考资料:《Hands-On Machine Learning with Scikit-Learn and TensorFlow》

猜你喜欢

转载自blog.csdn.net/SanyHo/article/details/107443406
今日推荐