sklearn学习之XGBoost算法实践

xgboost是陈天奇大神搞出来的大杀器,我在mac上费老半天劲还没安装好,查了各种安装教程,后来找到一个一句话安装,另一个大杀器anaconda,真香~

安装好之后就直接用,xgboost是gbdt的升级版,性能更强大,可以并行。前两年基本上是霸占kaggle,碾压其他算法。

import numpy as np
import random
import sklearn
import pandas as pd
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn import metrics
from sklearn.model_selection import train_test_split
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"


column_names = ['uin', 'gender', 'age', 'play_cnt', 'share_cnt', 'influence_pv', 'ds1', 'ds2', 'ds3', 'label']

data = pd.read_csv('lr_feature.csv', usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], names=column_names)
print(data.head(10))

# 分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(data[column_names[1:6]], data[column_names[9]],
                                                    test_size=0.25, random_state=3)


model = XGBClassifier(learning_rate=0.01,
                      n_estimators=10,           # 树的个数-10棵树建立xgboost
                      max_depth=3,               # 树的深度
                      min_child_weight=1,        # 叶子节点最小权重
                      gamma=0.,                  # 惩罚项中叶子结点个数前的参数
                      subsample=1,               # 所有样本建立决策树
                      colsample_btree=1,         # 所有特征建立决策树
                      scale_pos_weight=1,        # 解决样本个数不平衡的问题
                      random_state=27,           # 随机数
                      slient=0
                      )
model.fit(X_train, y_train)

# 预测
y_test, y_pred = y_test, model.predict(X_test)
print("Accuracy: %.4g" % metrics.accuracy_score(y_test, y_pred))
print("F1_score: %.4g" % metrics.f1_score(y_test, y_pred))
print("Recall: %.4g" % metrics.recall_score(y_test, y_pred))
y_train_proba = model.predict_proba(X_train)[:, 1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(y_train, y_train_proba))
y_proba = model.predict_proba(X_test)[:, 1]
print("AUC Score (Test): %f" % metrics.roc_auc_score(y_test, y_proba))

运行结果:

       uin  gender  age  play_cnt  share_cnt  influence_pv  ds1  ds2  ds3  label
0  1889812       2   67         2          1             0    0    2    2    0.0
1  1966339       2   69       747         92           194   15   15   30    1.0
2  1982539       2   66      1165        104            40   12   12   24    1.0
3  2131170       3   78        53        146           117    9    3   12    1.0
4  4471700       3   81         2          0             0    1    3    4    0.0
5  4921331       3   79      1634        176           178   15   15   30    1.0
6  5441180       3   68         0          4             0    0    4    4    0.0
7  6144422       2   79       109         23            25   10   14   24    1.0
8  6807020       3   72       418         54            90   11   11   22    1.0
9  7015648       3   76       144          9            15   11    7   18    1.0
Accuracy: 0.9668
F1_score: 0.97
Recall: 0.9693
AUC Score (Train): 0.989206
AUC Score (Test): 0.988982

我们使用特征比较少,因此树的深度只设定为3,数量是10,其他参数基本是用的默认值,当特征数量比较多的时候,调参会比较重要,选择一组好的参数效果很可能比花时间精力做特征工程好。调参的细节可以参考文献1,2。网上也有不少同学些的博客可以看。

参考资料:

  1. https://zhuanlan.zhihu.com/p/52501965
  2. https://zhuanlan.zhihu.com/p/68864414
  3. https://blog.csdn.net/sinat_20177327/article/details/81090324
  4. https://blog.csdn.net/han_xiaoyang/article/details/52665396
发布了114 篇原创文章 · 获赞 55 · 访问量 8万+

猜你喜欢

转载自blog.csdn.net/zuolixiangfisher/article/details/104259214