学习笔记(七)模型的调参之网格搜索和交叉验证的简单应用

数据是金融数据,我们要做的是预测贷款用户是否会逾期,表格中,status是标签:0表示未逾期,1表示逾期。
Misson1 - 构建逻辑回归模型进行预测
Misson2 - 构建SVM和决策树模型进行预测
Misson3 - 构建xgboost和lightgbm模型进行预测
Mission4 - 记录五个模型关于precision,rescore,f1,auc,roc的评分表格,画出auc和roc曲线图
Mission5 - 关于数据类型转换以及缺失值处理(尝试不同的填充看效果)以及你能借鉴的数据探索
Mission6 - 使用网格搜索和交叉验证进行简单调参。

数据概述

前期的数据处理见:上一篇文章
唯一不同的是只需要把数据分成标签和训练数据两个部分


datafinal = pd.concat([datanew,date_temp], axis=1)
data_train = datafinal['status']
datafinal.drop(["status"],axis=1,inplace=True)

X_train = datafinal 
y_train = data_train
standardScaler = StandardScaler()
X_train_fit = standardScaler.fit_transform(X_train)

交叉验证

交叉验证的基本思想是把在某种意义下将原始数据(dataset)进行分组,一部分做为训练集(train set),另一部分做为验证集(validation set or test set),首先用训练集对分类器进行训练,再利用验证集来测试训练得到的模型(model),以此来做为评价分类器的性能指标。

1. Cross——Validation 交叉验证

交叉验证的优点: 原始采用的train_test_split方法,数据划分具有偶然性;交叉验证通过多次划分,大大降低了这种由一次随机划分带来的偶然性,同时通过多次划分,多次训练,模型也能遇到各种各样的数据,从而提高其泛化能力;
首先将模型初始化

log_reg = LogisticRegression()
lsvc = SVC()
dtc = DecisionTreeClassifier()
xgbc_model = XGBClassifier()
lgbm_model = LGBMClassifier()
models = [log_reg,lsvc,dtc,xgbc_model,lgbm_model]

使用交叉验证

for model in models:
    score = cross_val_score(model, datafinal,data_train,cv=5)
    print("\n{}分数{}:".format(model,score))
    print("\n{}平均分数{}:".format(model,score.mean()))

2. k折交叉验证(kfold)

K折交叉验证:sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)
思路:将训练/测试数据集划分n_splits个互斥子集,每次用其中一个子集当作验证集,剩下的n_splits-1个作为训练集,进行n_splits次训练和测试,得到n_splits个结果。

for model in models:
    kf = KFold(n_splits=5,shuffle=False)
    score = cross_val_score(model, datafinal,data_train,cv=kf)
    
    print("\n{}分数{}:".format(model,score))
    print("\n{}平均分数{}:".format(model,score.mean()))

3.留一法Leave-one-out Cross-validation

留一法Leave-one-out Cross-validation:是一种特殊的交叉验证方式。顾名思义,如果样本容量为n,则k=n,进行n折交叉验证,每次留下一个样本进行验证。主要针对小样本数据。

for model in models:
    loout = LeaveOneOut()
    score = cross_val_score(model, datafinal,data_train,cv=loout)
    print("\n{}分数{}:".format(model,score))
    print("\n{}平均分数{}:".format(model,score.mean()))

4.Shuffle-split cross-validation

控制更加灵活:可以控制划分迭代次数、每次划分时测试集和训练集的比例(也就是说:可以存在既不在训练集也不再测试集的情况);

for model in models:
    shufspl = ShuffleSplit(train_size=.5,test_size=.4,n_splits=5)
    score = cross_val_score(model, datafinal,data_train,cv=shufspl)
    print("\n{}分数{}:".format(model,score))
    print("\n{}平均分数{}:".format(model,score.mean()))

网格搜索

穷举搜索:在所有候选的参数选择中,通过循环遍历,尝试每一种可能性,表现最好的参数就是最终的结果。其原理就像是在数组里找最大值。(为什么叫网格搜索?以有两个参数的模型为例,参数a有3种可能,参数b有4种可能,把所有可能性列出来,可以表示成一个3*4的表格,其中每个cell就是一个网格,循环过程就像是在每个网格里遍历、搜索。这里我们使用的是Grid Search with Cross Validation.
未调参前结果参考:https://blog.csdn.net/zhangyunpeng0922/article/details/84257426

1.逻辑回归的网格搜索

param_grid = [
    {
        'C':  [0.0001,0.001,0.01,0.1,1,10],
        'penalty': ['l2'],
        'tol': [1e-4,1e-5,1e-6]
    },
    {
        'C':  [0.0001,0.001,0.01,0.1,1,10],
        'penalty': ['l1'],
        'tol': [1e-4,1e-5,1e-6]
    }
]
score = make_scorer(accuracy_score)
decisions = {}
print("\n{}使用逻辑回归预测{}".format("*"*20, "*"*20))
kf = KFold(n_splits=5,shuffle=False)
grid_search = GridSearchCV(log_reg,param_grid,score,cv=kf)
grid_search = grid_search.fit(X_train, y_train)

得出的结果是:
grid_search.best_score_ :0.8019236549443943
grid_search.best_params_:{‘C’: 0.1, ‘penalty’: ‘l1’, ‘tol’: 1e-05}

最终模型评估:
predict: 0.6811492641906096
roc_auc_score准确率: 0.7046975097838823
precision_score准确率: 0.44631901840490795
recall_score准确率: 0.7558441558441559
f1_score: 0.6554107807490491

(不知为何和上次比结果差了一点,可能参数没有调好)

2. 决策树的网格搜索

"""
2 使用决策树预测
"""
param_grid_tree = [
    {
        'max_depth': [m for m in range(5,10)],
        'class_weight': ['balanced',None]
    }
]
decisions = {}
print("\n{}使用决策树预测{}".format("*"*20, "*"*20))
score = make_scorer(accuracy_score)
kf = KFold(n_splits=5,shuffle=False)
grid_search_tree = GridSearchCV(dtc,param_grid_tree,score,cv=kf)
grid_search_tree.fit(X_train_fit, y_train)

得出的结果:
grid_search_tree.best_score_:0.7664562669071235
grid_search_tree.best_params_:{‘class_weight’: None, ‘max_depth’: 5}

模型评估的结果:
DecisionTreeClassifier准确率: 0.7561317449194114
roc_auc_score准确率: 0.580806142034549
precision_score准确率: 0.6581196581196581
recall_score准确率: 0.2
f1_score准确率: 0.30677290836653387

(有一些提升)

3. svm的网格搜索

"""
2 使用SVC预测
"""
param_grid_svc = [
    {
        'C':  [4.5,5,5.5,6],
        'gamma': [0.0009,0.001,0.0011,0.002],
        'class_weight': ['balanced',None] 
    }
]
print("\n{}使用SVC预测{}".format("*"*20, "*"*20))
score = make_scorer(accuracy_score)
kf = KFold(n_splits=5,shuffle=False)
grid_search_svc = GridSearchCV(lsvc,param_grid_svc,score,cv=kf)
grid_search_svc.fit(X_train_fit, y_train)

得出的结果:
grid_search_svc.best_score_ :0.8001202284340246
grid_search_svc.best_params_:{‘C’: 4.5, ‘class_weight’: None, ‘gamma’: 0.001}

模型评估的结果:
linear_svc准确率: 0.7757533286615277
roc_auc_score准确率: 0.60652466535384
precision_score准确率: 0.773109243697479
recall_score准确率: 0.23896103896103896
f1_score准确率: 0.3650793650793651

4.XGboost的网格搜索

"""
2 使用xgboost预测
   
"""
param_grid_xgb = [
    {
        "max_depth": [10,30,50],
        "min_child_weight" : [1,3,6],
        "n_estimators": [200],
        "learning_rate": [0.05, 0.1,0.16]
         }
]
decisions = {}
print("\n{}使用xgboost预测{}".format("*"*20, "*"*20))
score = make_scorer(accuracy_score)
kf = KFold(n_splits=5,shuffle=False)
grid_search_xgb = GridSearchCV(xgbc_model,param_grid_xgb,score,cv=kf)
# sgd = SGDClassifier()
grid_search_xgb.fit(X_train_fit, y_train)

得出的结果:
grid_search_xgb.best_score_ :0.7923053802224226
grid_search_xgb.best_params_:{‘learning_rate’: 0.05, ‘max_depth’: 30, ‘min_child_weight’: 6, ‘n_estimators’: 200}

模型评估的结果:
xgbc_model准确率: 0.7785564120532585
roc_auc_score准确率: 0.6420170999825511
precision_score准确率: 0.6751269035532995
recall_score准确率: 0.34545454545454546
f1_score准确率: 0.45704467353951883

(感觉参数也没设置好,结果并没有什么提升)

5.lightGBM的网格搜索

"""
2 使用lightgmb预测
"""
param_grid_lgb = [
    {        
        "max_depth": [5,10, 15],
        "learning_rate" : [0.01,0.05,0.1],
        "num_leaves": [30,90,120],
        "n_estimators": [20]        
    }
]
decisions = {}
print("\n{}使用lightgmb预测{}".format("*"*20, "*"*20))
score = make_scorer(accuracy_score)
kf = KFold(n_splits=5,shuffle=False)
grid_search_lgb = GridSearchCV(lgbm_model,param_grid_lgb,score,cv=kf)
# sgd = SGDClassifier()
grid_search_lgb.fit(X_train_fit, y_train)

得出的结果:
grid_search_lgb.best_score_ :0.7953110910730388
grid_search_lgb.best_params_:{‘learning_rate’: 0.1, ‘max_depth’: 15, ‘n_estimators’: 20, ‘num_leaves’: 30}

模型评估的结果:
lgbm_model准确率: 0.7813594954449895
roc_auc_score准确率: 0.6283782436373607
precision_score准确率: 0.7354838709677419
recall_score准确率: 0.2961038961038961
f1_score准确率: 0.4222222222222222

问题

  1. 各个模型的参数区间并不是很好找
  2. 很多次调参之后比之前未调参的结果还要差一些
  3. 参数区间设置不好,又浪费时间,效果又不好
  4. 这是个技术活,还是个耐心活

参考

交叉验证:
https://blog.csdn.net/sinat_32547403/article/details/73008127
https://blog.csdn.net/kancy110/article/details/74910185/
http://www.itdaan.com/keywords/sklearn中的超参数调节.html

网格搜索:
https://blog.csdn.net/QFire/article/details/77601901
https://blog.csdn.net/qq_30490125/article/details/80387414
https://blog.csdn.net/owenfy/article/details/79631144

猜你喜欢

转载自blog.csdn.net/zhangyunpeng0922/article/details/84450013