【一周算法实践】__2.模型构建之集成模型

模型构建之集成模型

构建RF GBDT XDBoost LightGBM这四个模型,并对每一个模型使用准确率和AUC评分。在上次任务中使用了LR SVM DecisionTree这三个简单的模型对样本进行了预测和评价,请参照https://blog.csdn.net/wxq_1993/article/details/85703936。

#1.导入要使用的模块
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
import time
import warnings
warnings.filterwarnings('ignore')
# 2.划分X和y并简单分析数据
data_original=pd.read_csv("data_all.csv")
data_original.head(5)
data_original.describe() 
#data_original.info()
low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter trans_days_interval regional_mobility ... consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day reg_preference_for_trad latest_query_time_month latest_query_time_weekday loans_latest_time_month loans_latest_time_weekday
count 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 ... 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.000000 4754.00000 4754.000000 4754.000000
mean 0.021801 0.901332 1940.197728 14.152318 0.804493 0.365356 17.503155 29.004628 21.748422 2.678797 ... 5.088347 16418.973496 7507.426378 24.041649 51.984013 0.372949 4.273875 3.42196 4.542701 3.025873
std 0.041519 0.144837 3923.971494 693.961441 0.196920 0.170194 4.474686 22.711659 16.472031 0.890198 ... 3.344794 13885.107357 5830.674623 36.500344 53.249364 0.687382 1.333778 1.93213 2.987731 1.895870
min 0.000000 0.000000 0.000000 0.000000 0.120000 0.033000 2.000000 0.000000 4.000000 1.000000 ... 0.000000 0.000000 0.000000 -2.000000 -2.000000 0.000000 1.000000 0.00000 1.000000 0.000000
25% 0.010000 0.880000 0.000000 0.620000 0.670000 0.233000 15.000000 16.000000 12.000000 2.000000 ... 3.000000 7800.000000 4200.000000 6.000000 7.000000 0.000000 4.000000 2.00000 3.000000 2.000000
50% 0.010000 0.960000 500.000000 0.970000 0.860000 0.350000 17.000000 23.000000 17.000000 3.000000 ... 4.000000 14400.000000 6750.000000 16.000000 29.000000 0.000000 4.000000 4.00000 4.000000 3.000000
75% 0.020000 0.990000 2000.000000 1.600000 1.000000 0.479500 20.000000 32.000000 26.750000 3.000000 ... 7.000000 20400.000000 9696.250000 23.000000 86.000000 1.000000 5.000000 5.00000 5.000000 5.000000
max 1.000000 1.000000 68000.000000 47596.740000 1.000000 0.941000 42.000000 285.000000 234.000000 5.000000 ... 20.000000 266400.000000 82800.000000 360.000000 323.000000 4.000000 12.000000 6.00000 12.000000 6.000000

8 rows × 85 columns

y=data_original['status'].copy()
X=data_original.drop(['status'],axis=1).copy()
print("the X shape is:", X.shape)
print("the X shape is:" ,y.shape)
print("the nums of label 1 in y are",len(y[y==1]))
print("the nums of label 0 in y are",len(y[y==0]))
df_ret=pd.DataFrame(columns=('Model','Accuracy','AUC','Time'))
row=0
the X shape is: (4754, 84)
the X shape is: (4754,)
the nums of label 1 in y are 1193
the nums of label 0 in y are 3561

一共有4754组数据,每组数据中有84个特征;标签值中为1的有1193个,为0的有3561个;正样例与负样例数量差别较大,在后续处理应当考虑。

#3.数据集的三七划分
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2018)
print('the proportition of label 1 in y_test: %.2f%%'%(len(y_test[y_test==1])/len(y_test)*100))
the proportition of label 1 in y_test: 25.16%
# 4.定义一个评价函数
def evaluate(y_pre,y):
    acc=accuracy_score(y,y_pre)
    auc=roc_auc_score(y,y_pre)
    return acc,auc

由于在第一次作业中频繁调用accuracy_score()和f1_score(),在第二次作业中,将其定义成一个评价函数方便调用

问题来了,我从官方文档上直接复制RF GBDT XGBoost Lightgbm这四个分类器的默认参数,运行后竟然报错,提示有中文字符或者空格,只好如下这么简单输入了

# 5.构建模型进行预测
#分别采用 RF GBDT XGBoost Lightgbm,由于对模型不熟悉,故全部采用默认值

rf_model=RandomForestClassifier(n_estimators=100,max_depth=None,criterion='gini')
gbdt_model=GradientBoostingClassifier(n_estimators=100,max_depth=3,learning_rate=0.1)
xgb_model=XGBClassifier(n_estimators=100,learning_rate=0.1,max_depth=3)
lgbm_model=LGBMClassifier(n_estimators=100,learning_rate=0.1,max_depth=-1)
# 6.训练模型
models=[('RF',rf_model),('gbdt',gbdt_model),('xgb',xgb_model),('lgbm',lgbm_model)]
for name,model in models:
    print(name,'start training.....')
    startTime=time.clock()
    model.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    endTime=time.clock()
    print(name,'using time is %.4f'%(endTime-startTime))
    acc,auc=evaluate(y_pred,y_test)
    print(name,'accuracy_score:',round(acc,4),'auc_score: ',round(auc,4))
    df_ret.loc[row]=[name,acc,auc,(endTime-startTime)]
    row+=1
    print('\n')
print(df_ret)
RF start training.....
RF using time is 1.3224
RF accuracy_score: 0.7849 auc_score:  0.6076


gbdt start training.....
gbdt using time is 1.3351
gbdt accuracy_score: 0.78 auc_score:  0.6376


xgb start training.....
xgb using time is 0.7749
xgb accuracy_score: 0.7856 auc_score:  0.6432


lgbm start training.....
lgbm using time is 0.7061
lgbm accuracy_score: 0.7701 auc_score:  0.631


  Model  Accuracy       AUC      Time
0    RF  0.784863  0.607558  1.322362
1  gbdt  0.779958  0.637566  1.335071
2   xgb  0.785564  0.643161  0.774934
3  lgbm  0.770147  0.631012  0.706147

根据结果可知,集成学习的这四种模型明显好于第一次使用的三种模型,**其中XGBoost表现最好,LGBM速度最快;**由于复制默认参数报错,导致训练过程中只是用了三个参数,在后续的训练中继续改进。另外在面试过程中XGBoost和GBDT模型是经常被提问的,应当重点掌握。

参考资料:

1.集成模型
2.XGBoost:
3.RandomForest:
4.GradientBoostingClassifier:
5.xgboost的安装:
6.https://zhuanlan.zhihu.com/p/54042675

猜你喜欢

转载自blog.csdn.net/wxq_1993/article/details/85853808
今日推荐