机器学习中很多算法的参数选择是个比较繁琐的问题,人工调参比较费时,好在sklearn给我们提供了网格搜索参数的方法,其实就是类似暴力破解,先设定一些参数的取值,然后通过gridsearch,去寻找这些参数中表现的最好的参数。
我们依旧使用上一节的泰坦尼克号生存者预测数据集。同样使用随机森林算法,看看girdsearch如何使用。
先设置要调的参数和对应的取值:
param_grid = {
'bootstrap': [True],
'max_depth': [10, 20, 50],
'max_features': [len((X.columns))],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [4, 8],
'n_estimators': [5, 10, 50]
}
再初始化我们要用的算法,然后使用网格搜索,寻找最优参数:
#初始化模型
forest = RandomForestClassifier()
#初始化网格搜索
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3,
n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)
#查看最好的参数选择
print(grid_search.best_params_)
最后用网格搜索得到的参数,进行模型训练:
#使用网格搜索得到的最好的参数选择进行模型训练
best_forest = grid_search.best_estimator_
best_forest.fit(X_train, y_train)
全部的代码如下:
# -*- coding: utf-8 -*-
# @Time : 2018/12/14 上午9:59
# @Author : yangchen
# @FileName: gridsearch.py
# @Software: PyCharm
# @Blog :https://blog.csdn.net/opp003/article
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split
#导入数据
df = pd.read_csv('processed_titanic.csv', header=0)
#设置y值
X = df.drop(["survived"], axis=1)
y = df["survived"]
#训练集和测试集划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, shuffle=True)
#构建网格参数
param_grid = {
'bootstrap': [True],
'max_depth': [10, 20, 50],
'max_features': [len((X.columns))],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [4, 8],
'n_estimators': [5, 10, 50]
}
#初始化模型
forest = RandomForestClassifier()
#初始化网格搜索
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3,
n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)
#查看最好的参数选择
print(grid_search.best_params_)
#使用网格搜索得到的最好的参数选择进行模型训练
best_forest = grid_search.best_estimator_
best_forest.fit(X_train, y_train)
# 预测
pred_train = best_forest.predict(X_train)
pred_test = best_forest.predict(X_test)
#准确率
train_acc = accuracy_score(y_train, pred_train)
test_acc = accuracy_score(y_test, pred_test)
print ("训练集准确率: {0:.2f}, 测试集准确率: {1:.2f}".format(train_acc, test_acc))
#其他模型评估指标
precision, recall, F1, _ = precision_recall_fscore_support(y_test, pred_test, average="binary")
print ("precision: {0:.2f}. recall: {1:.2f}, F1: {2:.2f}".format(precision, recall, F1))
#特征重要度
features = list(X_test.columns)
importances = best_forest.feature_importances_
indices = np.argsort(importances)[::-1]
num_features = len(importances)
#将特征重要度以柱状图展示
plt.figure()
plt.title("Feature importances")
plt.bar(range(num_features), importances[indices], color="g", align="center")
plt.xticks(range(num_features), [features[i] for i in indices], rotation='45')
plt.xlim([-1, num_features])
plt.show()
#输出各个特征的重要度
for i in indices:
print ("{0} - {1:.3f}".format(features[i], importances[i]))
得到的结果:
{'bootstrap': True, 'max_depth': 20, 'max_features': 7, 'min_samples_leaf': 4, 'min_samples_split': 8, 'n_estimators': 5}
训练集准确率: 0.86, 测试集准确率: 0.76
precision: 0.86. recall: 0.79, F1: 0.82
sex - 0.428
age - 0.294
fare - 0.204
sibsp - 0.036
embarked - 0.030
parch - 0.008
pclass - 0.000
我们可以看到结果和上节所得到的结果,略有提升。其实网格搜索虽然方便了模型调参,但是还是需要建模人员有一定的调参经验作为基础的。