用xgboost构建一个简单的模型

这篇文章我们使用xgboost构建一个简单的模型以及xgboost与scikit-learn一起使用构建模型，用到的数据集是UCI机器学习库的mushroom数据集，用数据集中的22个特征来判断蘑菇是否有毒，步骤如下：
1. 导入模型需要的工具包，这里面我们用到了xgboost, sklearn, matplotlib, time, graphviz

import xgboost as xgb
import time
import graphviz
from sklearn.metrics import accuracy_score
from matplotlib import pyplot

2.读取建模所需的数据集存储在对象dmatrix中，包括训练过程中所需的训练集和测试所需的测试集。该数据为libsvm格式的文本数据，样式为：1 101:1.3 202:0.03 1:2.1 , 开头的1位样本的标签，101，202，1是特征的索引，1.3, 0.03, 2.1是特征值。

# read in data
my_workpath = '/Users/huoshirui/Desktop/xyworking/pythonData/'
dtrain = xgb.DMatrix(my_workpath + 'agaricus.txt.train')
dtest = xgb.DMatrix(my_workpath + 'agaricus.txt.test')

3.设置训练参数，xgboost的参数很多，这里面先设置4个参数。
max_depth: 树的最大深度，缺省值为6，取值范围为1到正无穷大，通常树的层数越深模型越复杂。
eta: 学习率或收缩步长，取值范围为0到1，缺省值为0.3。
silent: 0表示运行过程中不打印出信息，1表示运行过程中打印出信息，我们可以从打印出的信息看中间过程的运行结果，缺省值为0.
objective: 定义学习任务及相应的学习目标，因为我们的问题是二分类(判断蘑菇是否有毒)的逻辑回归(输出为是否有毒的概率)问题
迭代次数：定义模型会用到多少棵树

# specify parameters
param = {'max_depth':2, 'eta':1, 'silent':0, 'objective': 'binary:logistic'}
num_round = 2

4.开始训练模型，这里我们调用xgboost中的train函数，里面的参数为分别为：我们定义的训练参数param, 训练数据，迭代次数。

bst = xgb.train(param, dtrain, num_round)

5.用训练好的模型进行预测，这里可以用我们的测试集中的数据进行测试. 测试可以调用xgboost中的predict函数，里面的参数为测试数据。因为最后输出的是一个预测的百分数，所以我们要将百分数四舍五入转换成0或1来与实际的测试样本的标签进行比对。

test_preds = bst.predict(dtest)
test_predictions = [round(value) for value in test_preds]
y_test = dtest.get_label()
test_accuracy = accuracy_score(y_test, test_predictions)
print("Test accuracy: %2f%%" % (test_accuracy * 100.0))

6.模型建好后，我们可以看树的结构，plot_tree中使用的三个参数分别为：可视化的模型，要打印的树的索引，打印的格式（水平或垂直）

xgb.plot_tree(bst, num_trees=0, rankdir='LR')
pyplot.show()

运行代码，可以得到下面的树：
这里写图片描述

下面的代码示范了xgboost与scikit-learn一起使用：

from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
import time

# read in data
my_workpath = '/Users/huoshirui/Desktop/xyworking/pythonData/'
X_train, y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')
X_test, y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')

num_round = 2
bst = XGBClassifier(max_depth=5, learning_rate=1, n_estimators=num_round, silent=True, objective='binary:logistic')
bst.fit(X_train, y_train)

# 查看样本在测试集上的性能
test_preds = bst.predict(X_test)
test_predictions = [round(value) for value in test_preds]
test_accuracy = accuracy_score(y_test, test_predictions)
print("Test accuracy: %2f%%" % (test_accuracy * 100))

用xgboost构建一个简单的模型

猜你喜欢