为什么需要调参？

机器学习中最困难的地方就是为模型找到最好的超参数，模型的性能与超参数有很大的影响。

调参调的都是哪些参数？

随机森林:
==n-estimators:==决策树的个数，太小容易欠拟合，太大不容易提升模型
==bootstrp:==是否对样本进行有放回抽样来构建树，true or false
==oob_score:==是否采用袋外样本来评估模型的好坏，true or false
==max_depth:==决策树最大深度
==min_samples_leaf:==叶子节点含有的最少样本
==min_samples_split:==节点可分的最小样本数
==min_sample_leaf:==叶子节点最小的样本权重和
==max_leaf_nodes:==最大叶子节点数
==criterion:==节点划分标准
==max_features:==构建决策树最优模型时考虑的最大特征数。

怎样调参？

手工调参

顾名思义也就是通过训练算法手动选择最佳参数集，但是十分耗时，而且不能确保能找到最好的参数。

网格搜索

搜索是一种基本的超参数调优技术，为网格中指定的所有给定超参数值得每个排列构建模型，评估选择最佳模型。
由于超时了超参数得每一个组合，并根据交叉验证得分选择了最佳组合，所以会比较慢。

from sklearn.model_selection import GridSearchCV
clf=RandomForestRegressor()
grid=GridSearchCV(estimator=clf,param_distributions=random_grid,
                   n_iter=10,
                   cv=3,verbose=2,random_state=42,n_jobs=1)
grid.fit(x_train, y_train)
print(grid.best_params_)

随机搜索

随机搜索会优于网格搜索，由于许多情况下所有的超参数可能不是同等重要得，随机搜索从超参数空间中随机选择参数组合。
随机搜索不能保证给出的是最好得参数组合。

from sklearn.model_selection import RandomizedSearchCV
criterion=['mse','mae']
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 100, num = 10)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {
    
    'criterion':criterion,
                'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
clf=RandomForestRegressor()
clf_random = RandomizedSearchCV(estimator=clf,param_distributions=random_grid,
                   n_iter=10,
                   cv=3,verbose=2,random_state=42,n_jobs=1)
clf_random.fit(x_train, y_train)
print(clf_random.best_params_)

贝叶斯搜索

贝叶斯优化属于一类优化算法，成为基于序列模型得优化（SMBO）算法，这些算法使用先前对损失得观察结果以确定下一个最优点来抽样，该算法可以概括如下：

使用先前评估得点X1*：n*，计算损失f得后验期望。
在新的点X得抽样损失f，从而最大化f得期望得某些办法，该办法指定f域得哪些区域最适合抽样。
重复这些步骤，直到满足某些收敛准则。

如果超参数空间非常大，则使用随机搜索找到超参数得潜在组合，然后在局部使用网格搜索选择最优特征。

from sklearn.model_selection import BayesSearchCV
clf=RandomForestRegressor()
Bayes=BayesSearchCV(estimator=clf,param_distributions=random_grid,
                   n_iter=10,
                   cv=3,verbose=2,random_state=42,n_jobs=1)
Bayes.fit(x_train, y_train)
print(Bayes.best_params_)

K折交叉验证

模型如果过于复杂会导致过拟合，如果过于简单会导致欠拟合。为了防止数据划分不合理导致最终结果出现较大偏差，一般会采用K折交叉验证。也就是将原始数据划分为K份，其中一份作为测试集，其余都为训练集，重复取样K次，得到了K个模型和评估结果，最终取平均值作为最终的性能评估。
一般默认K=10，在sklearn中有已经封装好的方法，直接调用即可。

from sklearn.model_selection import cross_val_score
scores=cross_val_score(estimator=pipe_lr,x=x_train,y=y_train,cv=5,n_jobs=1)
print('CV accuracy scores: %s' % scores)

如何衡量参数是否合适

我们一般看一些指标来对比模型的好坏，调节参数前后，也是通过一些指标来衡量模型是否更好。

准确度（accuracy）
召回率（recall）
F1-score
Roc曲线
混淆矩阵
一般直接调用代码进行实现

from sklearn.metrics import precision_score, recall_score, f1_score
rf=RandomForestRegressor(criterion='mae',max_depth=100,min_sample_split=5,min_samples_leaf=4)
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)
print('precision:%.3f' %precision_score(y_true=y_test,y_pred=y_pred))
print('recall:%.3f' %recall_score(y_true=y_test,y_pred=y_pred))
print('F1:%.3f' %f1_score(y_true=y_test,y_pred=y_pred))

机器学习调整参数

机器学习调参学习