机器学习:sklearn算法参数选择--网格搜索

机器学习中很多算法的参数选择是个比较繁琐的问题,人工调参比较费时,好在sklearn给我们提供了网格搜索参数的方法,其实就是类似暴力破解,先设定一些参数的取值,然后通过gridsearch,去寻找这些参数中表现的最好的参数。

我们依旧使用上一节的泰坦尼克号生存者预测数据集。同样使用随机森林算法,看看girdsearch如何使用。

先设置要调的参数和对应的取值:

param_grid = {
    'bootstrap': [True],
    'max_depth': [10, 20, 50],
    'max_features': [len((X.columns))],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [4, 8],
    'n_estimators': [5, 10, 50]
}

再初始化我们要用的算法,然后使用网格搜索,寻找最优参数:

#初始化模型
forest = RandomForestClassifier()
#初始化网格搜索
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3,
                           n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

#查看最好的参数选择
print(grid_search.best_params_)

最后用网格搜索得到的参数,进行模型训练:

#使用网格搜索得到的最好的参数选择进行模型训练
best_forest = grid_search.best_estimator_
best_forest.fit(X_train, y_train)

 全部的代码如下:

# -*- coding: utf-8 -*-
# @Time    : 2018/12/14 上午9:59
# @Author  : yangchen
# @FileName: gridsearch.py
# @Software: PyCharm
# @Blog    :https://blog.csdn.net/opp003/article

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split


#导入数据
df = pd.read_csv('processed_titanic.csv', header=0)

#设置y值
X = df.drop(["survived"], axis=1)
y = df["survived"]

#训练集和测试集划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, shuffle=True)


#构建网格参数
param_grid = {
    'bootstrap': [True],
    'max_depth': [10, 20, 50],
    'max_features': [len((X.columns))],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [4, 8],
    'n_estimators': [5, 10, 50]
}

#初始化模型
forest = RandomForestClassifier()
#初始化网格搜索
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3,
                           n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

#查看最好的参数选择
print(grid_search.best_params_)

#使用网格搜索得到的最好的参数选择进行模型训练
best_forest = grid_search.best_estimator_
best_forest.fit(X_train, y_train)

# 预测
pred_train = best_forest.predict(X_train)
pred_test = best_forest.predict(X_test)

#准确率
train_acc = accuracy_score(y_train, pred_train)
test_acc = accuracy_score(y_test, pred_test)
print ("训练集准确率: {0:.2f}, 测试集准确率: {1:.2f}".format(train_acc, test_acc))

#其他模型评估指标
precision, recall, F1, _ = precision_recall_fscore_support(y_test, pred_test, average="binary")
print ("precision: {0:.2f}. recall: {1:.2f}, F1: {2:.2f}".format(precision, recall, F1))

#特征重要度
features = list(X_test.columns)
importances = best_forest.feature_importances_
indices = np.argsort(importances)[::-1]
num_features = len(importances)


#将特征重要度以柱状图展示
plt.figure()
plt.title("Feature importances")
plt.bar(range(num_features), importances[indices], color="g", align="center")
plt.xticks(range(num_features), [features[i] for i in indices], rotation='45')
plt.xlim([-1, num_features])
plt.show()

#输出各个特征的重要度
for i in indices:
    print ("{0} - {1:.3f}".format(features[i], importances[i]))

得到的结果:


{'bootstrap': True, 'max_depth': 20, 'max_features': 7, 'min_samples_leaf': 4, 'min_samples_split': 8, 'n_estimators': 5}
训练集准确率: 0.86, 测试集准确率: 0.76
precision: 0.86. recall: 0.79, F1: 0.82
sex - 0.428
age - 0.294
fare - 0.204
sibsp - 0.036
embarked - 0.030
parch - 0.008
pclass - 0.000

我们可以看到结果和上节所得到的结果,略有提升。其实网格搜索虽然方便了模型调参,但是还是需要建模人员有一定的调参经验作为基础的。

猜你喜欢

转载自blog.csdn.net/opp003/article/details/84998740