任务五

一、导入数据
本次导入的是前文(任务 2 特征工程)已经完成特征工程过的数据
data_del = pd.read_csv(‘data_del.csv’)
data_del.head()
二、划分数据
调用sklearn包将数据集按比例7:3划分为训练集和数据集,随机种子2018:
X_train, X_test, y_train, y_test = train_test_split(data_del.drop([‘status’], axis=1).values,
data_del[‘status’].values, test_size=0.3,
random_state=2018)

查看划分的数据集和训练集大小:
[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(3133, 12), (3133,), (1343, 12), (1343,)]
三、模型融合(Stacking)
将数据分为训练集、测试集两部分。
对训练集进行K折交叉验证。以五折为例,将训练集分为5个部分,每次取一个部分作为验证数据,其余作为训练数据。用训练数据训练模型后,再对验证数据进行预测,预测的结果即图中橙色的Predict。最后将五次交叉验证得到的Predict整合为橙色的Predictions,作为下一个模型的训练集。
测试集
在每次交叉验证用训练数据训练模型后,不仅要对验证集进行预测,还要对测试集进行预测,每次预测的结果就是图中绿色部分Predict。最后将五次Predict的结果取平均值,就得到了下一个模型的测试集(绿色Predictions)。
def get_stacking_data(models, X_train, y_train, X_test, y_test, k=5):
‘’‘获得下一模型的训练集,测试集
models: 当前模型
X_train: 当前训练数据
y_train: 当前训练标签
X_test: 当前测试数据
y_test: 当前测试标签
k: K折交叉验证
return: new_train: 下一个模型的训练集
new_test: 下一个模型的测试集
‘’’
kfold = KFold(n_splits=k, random_state=2018, shuffle=True)
next_train = np.zeros((X_train.shape[0], len(models)))
next_test = np.zeros((X_test.shape[0], len(models)))

for j, model in enumerate(models):
    next_test_temp = np.zeros((X_test.shape[0], k))
    ksplit = kfold.split(X_train)
    for i, (train_index, val_index) in enumerate(ksplit):
        X_train_fold, y_train_fold = X_train[train_index], y_train[train_index]
        X_val = X_train[val_index]
        model.fit(X_train_fold, y_train_fold)
        next_train[val_index, j] = model.predict(X_val)
        next_test_temp[:, i] = model.predict(X_test)
    next_test[:, j] = np.mean(next_test_temp, axis=1)

return next_train, next_test

四、融合模型选择
查看默认参数下七个模型的评估结果,代码见任务四上篇文章:
AUC Accuracy F1-score Precision Recall
随机森林 训练集:90.82%;测试集:79.88% 训练集:84.33%;测试集:79.60% 训练集:58.07%;测试集:46.69% 训练集:43.59%;测试集:34.68% 训练集:86.96%;测试集:71.43%
GBDT 训练集:87.87%;测试集:79.15% 训练集:84.26%;测试集:78.78% 训练集:57.17%;测试集:44.01% 训练集:42.18%;测试集:32.37% 训练集:88.68%;测试集:68.71%
XGBoost 训练集:90.41%;测试集:79.28% 训练集:85.06%;测试集:79.23% 训练集:63.03%;测试集:49.18% 训练集:51.15%;测试集:39.02% 训练集:82.10%;测试集:66.50%
LightGBM 训练集:86.70%;测试集:79.53% 训练集:82.41%;测试集:78.93% 训练集:49.77%;测试集:41.41% 训练集:35.00%;测试集:28.90% 训练集:86.12%;测试集:72.99%
逻辑回归 训练集:76.33%;测试集:78.34% 训练集:78.77%;测试集:78.70% 训练集:37.68%;测试集:38.89% 训练集:25.77%;测试集:26.30% 训练集:70.03%;测试集:74.59%
SVM 训练集:80.23%;测试集:74.26% 训练集:80.82%;测试集:77.96% 训练集:43.14%;测试集:34.80% 训练集:29.23%;测试集:22.83% 训练集:82.31%;测试集:73.15%
决策树 训练集:76.63%;测试集:74.19% 训练集:79.29%;测试集:77.14% 训练集:46.41%;测试集:43.46% 训练集:36.03%;测试集:34.10% 训练集:65.20%;测试集:59.90%
四个集成模型(随机森林、GBDT、XGBoost、LightGBM)和逻辑回归明显效果较好,将随机森林、GBDT、逻辑回归、LightGBM作为基础模型,XGBoost作为第二层模型。训练参数暂时采用默认参数。
rnd_clf = RandomForestClassifier(random_state=2018)
gbdt = GradientBoostingClassifier(random_state=2018)
xgb = XGBClassifier(random_state=2018)
lgbm = LGBMClassifier(random_state=2018)
log = LogisticRegression(random_state=2018, max_iter=1000)
svc = SVC(random_state=2018, probability=True)
tree = DecisionTreeClassifier(random_state=2018
base_models = [rnd_clf, gbdt, lgbm, log]
next_train, next_test = get_stacking_data(base_models, X_train, y_train, X_test, y_test, k=10)
融合模型训练及评估
stacking_model= XGBClassifier(random_state=2018)
stacking_model.fit(next_test, y_test)
XGBClassifier(base_score=0.5, booster=‘gbtree’, colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective=‘binary:logistic’,
random_state=2018, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
seed=None, silent=True, subsample=1)
AUC Accuracy F1-score Precision Recall
融合(Stacking)模型 训练集:64.89%测试集:79.04% 训练集:78.58%;测试集:83.54% 训练集:39.06%;测试集:59.15% 训练集:27.56%;测试集:46.24% 训练集:66.98%;测试集:82.05%
五、总结
相比于其他模型,融合模型除了AUC值外的其他指标值都有所上升,而且是在使用默认参数的情况下。相信经过调参后,效果会有进一步的提升。

猜你喜欢

转载自blog.csdn.net/weixin_41741008/article/details/88387192