机器学习 scikit-learn5 - 预测贷款用户是否会逾期 - 模型性能评估

核心代码

代码路径 https://github.com/spareribs/kaggleSpareribs/blob/master/Overdue/ml/code/sklearn_train_all.py

代码使用方法

  1. 【必须】config.py 设置文件存放的路径
  2. 【必须】先执行 features 中的 base.py 先把数据处理好 [PS:需要根据实际情况修改]
  3. 【可选】再通过 code 中的 sklearn_gcv.py 搜索模型的最佳配置
  4. 【必须】最后通过 code 中的 sklearn_train_auc.py 训练模型输出结果

数据输出的代码

def model_metrics(clf, x_train, x_vali, y_train, y_vali):
    """
    
    :param clf: 模型
    :param x_train: 训练集 
    :param x_vali: 测试集
    :param y_train: 训练集标签
    :param y_vali: 测试集标签
    :return: 绘图所需要的数据, 以 dict 的形式返回
    """
    print("测试模型 & 模型参数如下:\n{0}".format(clf))
    # pre_train = clf.predict(x_train)
    # print("训练集正确率: {0:.4f}".format(clf.score(x_train, y_train)))
    # print("训练集f1分数: {0:.4f}".format(f1_score(y_train, pre_train)))
    # print("训练集auc分数: {0:.4f}".format(roc_auc_score(y_train, pre_train)))
    y_train_pred = clf.predict(x_train)
    y_vali_pred = clf.predict(x_vali)
    y_train_pred_proba = clf.predict_proba(x_train)[:, 1]
    y_vali_pred_proba = clf.predict_proba(x_vali)[:, 1]

    print("=" * 20)
    # 准确性
    print("准确性: \n训练集: {0:.4f}\n测试集: {1:.4f}".format(
        accuracy_score(y_train, y_train_pred),
        accuracy_score(y_vali, y_vali_pred)
    ))
    print("-" * 20)
    # 召回率
    print("召回率: \n训练集: {0:.4f}\n测试集: {1:.4f}".format(
        recall_score(y_train, y_train_pred),
        recall_score(y_vali, y_vali_pred)
    ))
    print("-" * 20)
    # f1_score
    print("f1_score: \n训练集: {0:.4f}\n测试集: {1:.4f}".format(
        f1_score(y_train, y_train_pred),
        f1_score(y_vali, y_vali_pred)
    ))
    print("-" * 20)
    # roc_auc
    roc_auc_train = roc_auc_score(y_train, y_train_pred_proba),
    roc_auc_vali = roc_auc_score(y_vali, y_vali_pred_proba)

    print("roc_auc: \n训练集: {0:.4f}\n测试集: {1:.4f}".format(roc_auc_train[0], roc_auc_vali))
    print("-" * 20)
    # 描绘 ROC 曲线
    fpr_tr, tpr_tr, _ = roc_curve(y_train, y_train_pred_proba)
    fpr_te, tpr_te, _ = roc_curve(y_vali, y_vali_pred_proba)
    print("描绘 ROC 曲线: \n训练集: fpr_tr {0} tpr_tr {1}\n测试集: fpr_tr {2} tpr_tr {3}".format(
        len(fpr_tr), len(tpr_tr),
        len(fpr_te), len(tpr_te)
    ))
    print("-" * 20)
    # KS
    ks_train = max(abs((fpr_tr - tpr_tr))),
    ks_vali = max(abs((fpr_te - tpr_te)))
    print("KS: \n训练集: {0:.4f}\n测试集: {1:.4f}".format(
        ks_train[0],
        ks_vali
    ))
    print("=" * 20)
    rou_auc = {
        "roc_auc_train": roc_auc_train[0],
        "roc_auc_vali": roc_auc_vali,
        "ks_train": ks_train[0],
        "ks_vali": ks_vali,
        "fpr_tr": fpr_tr,
        "tpr_tr": tpr_tr,
        "fpr_te": fpr_te,
        "tpr_te": tpr_te,
    }
    return rou_auc

绘图的代码

plt.plot(rou_auc.get("fpr_tr"), rou_auc.get("tpr_tr"), 'r-',
         label="Train:AUC: {:.3f} KS:{:.3f}".format(rou_auc.get("roc_auc_train"), rou_auc.get("ks_train")))
plt.plot(rou_auc.get("fpr_te"), rou_auc.get("tpr_te"), 'g-',
         label="Test:AUC: {:.3f} KS:{:.3f}".format(rou_auc.get("roc_auc_vali"), rou_auc.get("ks_vali")))
plt.plot([0, 1], [0, 1], 'd--')
plt.legend(loc='best')
plt.title("{0} ROC curse".format(clf_name))
# plt.savefig("{0}_roc_auc.jpg".format(clf_name))
plt.show()

模型性能评估

模型 准确性 召回率 f1_score ROC_AUC KS ROC曲线
逻辑回归 train: 0.8010
test: 0.7905
train: 0.3337
test: 0.3333
train: 0.4538
test: 0.4514
train: 0.8003
test: 0.7901
train: 0.4574
test: 0.4550
线性svm train: 0.7977
test: 0.7884
train: 0.2621
test: 0.2547
train: 0.3910
test: 0.3837
train: 0.8104
test: 0.7915
train: 0.4824
test: 0.4369
多项式svm train: 0.8206
test: 0.7526
train: 0.2816
test: 0.1247
train: 0.4373
test: 0.2067
train: 0.9439
test: 0.7461
train: 0.7957
test: 0.4003
决策树 train: 0.9856
test: 0.7821
train: 0.9417
test: 0.3252
train: 0.9700
test: 0.4356
train: 0.9996
test: 0.7336
train: 0.9806
test: 0.3672
xgboost train: 0.8515
test: 0.7898
train: 0.4769
test: 0.3388
train: 0.6141
test: 0.4545
train: 0.9119
test: 0.7958
train: 0.6561
test: 0.4587
lightgbm train: 1.0000
test: 0.7849
train: 1.0000
test: 0.3930
train: 1.0000
test: 0.4858
train: 1.0000
test: 0.7811
train: 1.0000
test: 0.4344

疑问

  1. SVM fit 的时候等待时间超长
    猜测是梯度下降的时候,步长过长导致模型找不到最优解,无法收敛,添加 max_iter 的时候发现 模型会输出结果,但是AUC曲线显示非常奇特,如下图所示:
    在这里插入图片描述
    对比 标准化 的代码,发现没有对数据进行归一化处理,用错了变量~~
    在这里插入图片描述
  2. lightgbm训练的结果都是1,目测是过拟合了,有待深入分析

猜你喜欢

转载自blog.csdn.net/q370835062/article/details/84436428
今日推荐