模型构建之集成模型

构建RF GBDT XDBoost LightGBM这四个模型，并对每一个模型使用准确率和AUC评分。在上次任务中使用了LR SVM DecisionTree这三个简单的模型对样本进行了预测和评价，请参照https://blog.csdn.net/wxq_1993/article/details/85703936。

#1.导入要使用的模块
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
import time
import warnings
warnings.filterwarnings('ignore')

# 2.划分X和y并简单分析数据
data_original=pd.read_csv("data_all.csv")
data_original.head(5)
data_original.describe() 
#data_original.info()

	low_volume_percent	middle_volume_percent	take_amount_in_later_12_month_highest	trans_amount_increase_rate_lately	trans_activity_month	trans_activity_day	transd_mcc	trans_days_interval_filter	trans_days_interval	regional_mobility	...	consfin_product_count	consfin_max_limit	consfin_avg_limit	latest_query_day	loans_latest_day	reg_preference_for_trad	latest_query_time_month	latest_query_time_weekday	loans_latest_time_month	loans_latest_time_weekday
count	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	...	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.000000	4754.00000	4754.000000	4754.000000
mean	0.021801	0.901332	1940.197728	14.152318	0.804493	0.365356	17.503155	29.004628	21.748422	2.678797	...	5.088347	16418.973496	7507.426378	24.041649	51.984013	0.372949	4.273875	3.42196	4.542701	3.025873
std	0.041519	0.144837	3923.971494	693.961441	0.196920	0.170194	4.474686	22.711659	16.472031	0.890198	...	3.344794	13885.107357	5830.674623	36.500344	53.249364	0.687382	1.333778	1.93213	2.987731	1.895870
min	0.000000	0.000000	0.000000	0.000000	0.120000	0.033000	2.000000	0.000000	4.000000	1.000000	...	0.000000	0.000000	0.000000	-2.000000	-2.000000	0.000000	1.000000	0.00000	1.000000	0.000000
25%	0.010000	0.880000	0.000000	0.620000	0.670000	0.233000	15.000000	16.000000	12.000000	2.000000	...	3.000000	7800.000000	4200.000000	6.000000	7.000000	0.000000	4.000000	2.00000	3.000000	2.000000
50%	0.010000	0.960000	500.000000	0.970000	0.860000	0.350000	17.000000	23.000000	17.000000	3.000000	...	4.000000	14400.000000	6750.000000	16.000000	29.000000	0.000000	4.000000	4.00000	4.000000	3.000000
75%	0.020000	0.990000	2000.000000	1.600000	1.000000	0.479500	20.000000	32.000000	26.750000	3.000000	...	7.000000	20400.000000	9696.250000	23.000000	86.000000	1.000000	5.000000	5.00000	5.000000	5.000000
max	1.000000	1.000000	68000.000000	47596.740000	1.000000	0.941000	42.000000	285.000000	234.000000	5.000000	...	20.000000	266400.000000	82800.000000	360.000000	323.000000	4.000000	12.000000	6.00000	12.000000	6.000000

8 rows × 85 columns

y=data_original['status'].copy()
X=data_original.drop(['status'],axis=1).copy()
print("the X shape is:", X.shape)
print("the X shape is:" ,y.shape)
print("the nums of label 1 in y are",len(y[y==1]))
print("the nums of label 0 in y are",len(y[y==0]))
df_ret=pd.DataFrame(columns=('Model','Accuracy','AUC','Time'))
row=0

the X shape is: (4754, 84)
the X shape is: (4754,)
the nums of label 1 in y are 1193
the nums of label 0 in y are 3561

一共有4754组数据，每组数据中有84个特征；标签值中为1的有1193个，为0的有3561个;正样例与负样例数量差别较大，在后续处理应当考虑。

#3.数据集的三七划分
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2018)
print('the proportition of label 1 in y_test: %.2f%%'%(len(y_test[y_test==1])/len(y_test)*100))

the proportition of label 1 in y_test: 25.16%

# 4.定义一个评价函数
def evaluate(y_pre,y):
    acc=accuracy_score(y,y_pre)
    auc=roc_auc_score(y,y_pre)
    return acc,auc

由于在第一次作业中频繁调用accuracy_score()和f1_score(),在第二次作业中，将其定义成一个评价函数方便调用

问题来了，我从官方文档上直接复制RF GBDT XGBoost Lightgbm这四个分类器的默认参数，运行后竟然报错，提示有中文字符或者空格，只好如下这么简单输入了

# 5.构建模型进行预测
#分别采用 RF GBDT XGBoost Lightgbm,由于对模型不熟悉，故全部采用默认值

rf_model=RandomForestClassifier(n_estimators=100,max_depth=None,criterion='gini')
gbdt_model=GradientBoostingClassifier(n_estimators=100,max_depth=3,learning_rate=0.1)
xgb_model=XGBClassifier(n_estimators=100,learning_rate=0.1,max_depth=3)
lgbm_model=LGBMClassifier(n_estimators=100,learning_rate=0.1,max_depth=-1)

# 6.训练模型
models=[('RF',rf_model),('gbdt',gbdt_model),('xgb',xgb_model),('lgbm',lgbm_model)]
for name,model in models:
    print(name,'start training.....')
    startTime=time.clock()
    model.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    endTime=time.clock()
    print(name,'using time is %.4f'%(endTime-startTime))
    acc,auc=evaluate(y_pred,y_test)
    print(name,'accuracy_score:',round(acc,4),'auc_score: ',round(auc,4))
    df_ret.loc[row]=[name,acc,auc,(endTime-startTime)]
    row+=1
    print('\n')
print(df_ret)

RF start training.....
RF using time is 1.3224
RF accuracy_score: 0.7849 auc_score:  0.6076


gbdt start training.....
gbdt using time is 1.3351
gbdt accuracy_score: 0.78 auc_score:  0.6376


xgb start training.....
xgb using time is 0.7749
xgb accuracy_score: 0.7856 auc_score:  0.6432


lgbm start training.....
lgbm using time is 0.7061
lgbm accuracy_score: 0.7701 auc_score:  0.631


  Model  Accuracy       AUC      Time
0    RF  0.784863  0.607558  1.322362
1  gbdt  0.779958  0.637566  1.335071
2   xgb  0.785564  0.643161  0.774934
3  lgbm  0.770147  0.631012  0.706147

根据结果可知，集成学习的这四种模型明显好于第一次使用的三种模型，**其中XGBoost表现最好，LGBM速度最快；**由于复制默认参数报错，导致训练过程中只是用了三个参数，在后续的训练中继续改进。另外在面试过程中XGBoost和GBDT模型是经常被提问的，应当重点掌握。

参考资料：

1.集成模型
2.XGBoost:
3.RandomForest:
4.GradientBoostingClassifier:
5.xgboost的安装:
6.https://zhuanlan.zhihu.com/p/54042675

【一周算法实践】__2.模型构建之集成模型

模型构建之集成模型

参考资料：

猜你喜欢