一份蚂蚁金服笔试作答

(1) 描述一个你曾经完成或参与的数据分析的项目

  1. 数据集的大小,包含多少变量/字段?你曾经用过的最大数据是多少?数据的采样是如何进行的?如果是你进行采样,用了什么语言/算法/软件? 
  2. 在数据处理中,你需要考虑哪些因素,是否需要做数据清洗?是否数据采样有取样偏差(sampling Bias)?
  3. 你采用了什么样的分析/建模方法?是否这些方法符合业界标准?是否对你的数据最优?
  4. 如何实现以上的方法?

答:

  1. 用过最大数据集2TB一季度用户广告点击数据,包括字段

基础数据准备阶段

用户行为数据:展示、曝光、浏览、收藏、搜索、点赞、评论、点击、转发、加车、购买

用户属性数据:年龄、性别、地域、学历、职业、收入、用户画像数据(兴趣、是否有孩子、是否有房、工作时间段、是否常出差、小资、工薪......)

上下文数据:打开APP所在场景、时间、日期是否节假日、用户当前状态、浏览界面内容上下文

广告创意数据:广告组画像数据(轻奢、游戏、文艺、土豪、小资......)、创意文本、创意类型、创意海报质量分、创意图片嵌入向量、创意历史曝光率、创意历史展示率、创意点击率、创意转换率、创意平均付费......

平台数据:平台日GMV、平台外部ADX请求关键词......

 数据采样:标签正负样本分布统计、分特征采样、按特征分层采样

采样目的是希望通过采样对数据有个整体了解,如果是分类问题可以对所有类别做个统计,看类别分布趋势,样本分布是否均衡;

对特征进行采样,确认数据空值、异常值分布,各特征值范围区间、非连续特征的有多少类、各类分布;

可以对各特征分布和类别分布做散点图看特征和类别相关性,特征和特征之间也可以做简单相关性分析;

常用的工具有HIVE、Pandas、sparkSQL

B.数据处理

在做数据处理时候主要从数据清洗和考量数据是否偏差影响最后模型预测准确性

数据清洗

缺失值处理

归一化

异常值和数值截断

非线性变化

特征处理:

离散特征

One-hot编码

散列编码

计数编码

离散特征之间交叉

离散特征于连续特征交叉

连续特征

直接使用

离散化

特征交叉

时空特征

转化为数值

将时间离散化

行政区表示

经纬度表示

距离表示

文本特征:将文本转换为高纬度向量表示,word2vec、BERT

图像富文本:嵌入表示成向量

特征筛选:

基于统计学方法:

选择方差大的特征

皮尔逊相关系数

覆盖率

假设检验

互信息

基于模型选择:

基于模型参数

子集选择

样本偏差:

样本偏差主要分为采样数据和实际样本分布偏差、样本各类别数量差距太大

采样和实际分布偏差:

主要是在系统刚上线或系统服务经历跳变(秒杀、双12),对于刚上线系统采样偏差可以利用其他相似网站经验数据对样本分布纠偏、对于系统服务跳变可以利用历史同活动数据纠偏或做仿真数据纠偏

样本类别数量差距太大:

数据增强适当增加小量样本的数据量(一般可以增加3-6倍)

或适当减少大量样本数据的数量

C.特征选择好、样本构造好,进入建模阶段

选择合适模型:LR、GBDT+LR、XGBOOST、wide&deep

模型参数调优:网格搜索法、BADCASE优化、迭代调优

模型评估:AUC、ROC、NDGC、learning curve判断是否过拟合、线上AB测试

模型集成:

D.模型更新

批次训练:每日或每小时数据训练模型,全量或增量更新线上参数(增量一般是老参数+新参数*小权重)

在线训练:FTRL

(2你常用哪些方法清洗/分析数据?  

答:数据清洗

缺失值处理

归一化

异常值和数值截断

非线性变化

对于文本类数据,可以把一些虚词、标点符号、高频、低频词去除

  1. 数据分析建模测试

答:数据分析按建模流程可以分为三块:
原始数据分析:分析数据整体分布特征,挖掘数据和标签之间关系

特征相关性分析:挖掘特征和标签之间相关关系、去除特征间一致相关特征

模型效果分析:帮助模型调参、模型准确性召回率提升

建模测试:

样本训练充分:N折交叉训练测试

防止过拟合:F1指数、learning curve

建模数据和真实场景模型一致:线上测试

阅读以下文字答题

 Field Descriptions:

isbuyer - Past purchaser of product
buy_freq - How many times purchased in the past
visit_freq - How many times visited website in the past
buy_interval - Average time between purchases
sv_interval - Average time between website visits
expected_time_buy - ?
expected_time_visit - ?
last_buy - Days since last purchase.
last_visit - Days since last website visit.
multiple_buy - ?
multiple_visit - ?
uniq_url - Number of unique urls we observed web browser on.
num_checkins - Number of times we observed web browser.
y_buy - Outcome variable of interest, Did they purchase in period of interest.

Question: 

Each observation in the provided training/test dataset is a web browser (or cookie) in our observed Universe. The goal is to model the behavior of a future purchase and classify cookies into those that will purchase in the future and those that will not. y_buy is the outcome variable that represents if a cookie made a purchase in the period of interest. All of the rest of the columns in the data set were recorded prior to this purchase and may be used to predict purchase. Please use ‘ads_train.csv’ as training data to create at least two different classes of models (e.g. logistic regression, random forest, etc.) to classify these cookies into future buyers or not. Explain your choice of model, how you did model selection, how you validated the quality of the model, and which variables are most informative of purchase. Also, comment on any general trends or anomalies in the data you can identify as well as propose a meaning for those fields not defined. The deliverable is a document with text and figures illustrating your thought process, how you began to explore the data, and a comparison of the models that you created. When evaluating your models, consider metrics such as AUC of Precision-Recall Curve, precision, recall. This should take about 6 hours and can be done using any programming language or statistical package (R or Python are preferred). Finally, perform prediction on test dataset ‘ads_test.csv’ using your chosen model(s) and report predicted probabilities of future purchase and predicted labels of future purchase. 

Please also do include codes with your document (Python /R is recommended)

import pandas as pd
import numpy as np
%matplotlib inline
#数据导入
train_csv = pd.read_csv("./ads_train.csv",index_col=0)
test_csv = pd.read_csv("./ads_test.csv",index_col=0)
#数据总体概览
test_csv.head()
train_csv.head()
train_csv.info()
train_csv.describe()
‘’‘数据处理&数据分析
类别字段onhot编码(isbuyer、multiple_buy、multiple_visit、y_buy)
ID类字段:uniq_urls
数值型字段:buy_freq、visit_freq、buy_interval、sv_interval、expected_time_buy、expected_time_visit、last_buy、last_visit、num_checkins
异常值处理’‘’
target = train_csv.groupby(['y_buy']).size().sort_values(ascending=False)
target
#正负样本不均衡,模型训练时需要做数据均衡处理
target.plot.bar()
train_csv.dtypes
#特征构造
#类别分布分析 连续型字段分析 特征处理
train_csv = train_csv.drop('uniq_urls', axis=1)
test_csv = test_csv.drop('uniq_urls', axis=1)
train_csv[['isbuyer','multiple_buy','multiple_visit']] = train_csv[['isbuyer','multiple_buy','multiple_visit']].astype(object)
X = train_csv.drop('y_buy', axis=1)
y = train_csv['y_buy']
#正负样本不均衡,通过数据增强,让训练样本均衡
from collections import Counter
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resample, y_resample = ros.fit_resample(X, y)
print(Counter(y_resample))
'''建模&参数调优
选RF、GBDT、KNNClassifier作为分类模型
原因是数据里面既有连续数值类、也有统计数值类、还有离散类别和id类,且数值类差距比较大、
树类模型和近邻模型对这类数据类型多且数据range较大数据表现效果较好'''
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_resample,y_resample, test_size=0.2)
Counter(y)
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.compose import ColumnTransformer
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

numeric_features =train_csv.select_dtypes(include=['int','float']).drop(['y_buy'], axis=1).columns
categorical_features = train_csv.select_dtypes(include=['object']).columns 

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
numeric_features
categorical_features 
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])
#模型自动搜索
param_grid = { 
    'classifier__n_estimators': [200, 500],
    'classifier__max_features': ['auto', 'sqrt', 'log2'],
    'classifier__max_depth' : [4,5,6,7,8],
    'classifier__criterion' :['gini', 'entropy']}

CV = GridSearchCV(rf, param_grid, n_jobs= 1)
                  
CV.fit(X_train, y_train)  
print(CV.best_params_)    
print(CV.best_score_)
#RandomForest模型结果可视化
from sklearn.metrics import precision_score,recall_score,f1_score,roc_auc_score,roc_curve
import matplotlib.pyplot as plt

best_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(criterion='entropy', max_depth= 8, max_features='sqrt', n_estimators= 500))])
best_model.fit(X_train, y_train)
best_H = best_model.predict(X_test)
best_yH = best_model.predict_proba(X_test)
# 并输出测试数据集的精确率、召回率、F1值、AUC值,画出ROC曲线
print('精准率:',precision_score(y_test,best_H))
print('召回率:',recall_score(y_test,best_H))
print('F1率:',f1_score(y_test,best_H))
print('AUC:',roc_auc_score(y_test,best_yH[:,-1]))
fpr,tpr,theta = roc_curve(y_test,best_yH[:,-1])
print('fpr=n',fpr)
print('tpr=n',tpr)
print('theta=n',theta)
#画出ROC曲线
plt.plot(fpr,tpr)
plt.show()
#参数优化是一个迭代的过程,一般会先通过gridsearch找到几个参数最优值,选出最优值固定然后在gridsearch选其他参数
#参数都选定后在微调通过gridsearch方法找到最优参数组合
params={'classifier__learning_rate':[0.01,0.05,0.25,5],  'classifier__max_depth':[1,2,3,5,7,8], 'classifier__min_samples_leaf':[1,2,3,4,5],  'classifier__n_estimators':[50,60,70,80,90,100]}

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', GradientBoostingClassifier())])
grid = GridSearchCV(clf, params)
grid.fit(X_train, y_train)
print(grid.best_params_)    
print(grid.best_score_)
#GradientBoosting模型结果可视化
from sklearn.metrics import precision_score,recall_score,f1_score,roc_auc_score,roc_curve
import matplotlib.pyplot as plt

best_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', GradientBoostingClassifier(learning_rate=0.05, n_estimators=1200,max_depth=7, min_samples_leaf =60, 
               min_samples_split =1200, max_features=9, subsample=0.7, random_state=10))])
best_model.fit(X_train, y_train)
best_H = best_model.predict(X_test)
best_yH = best_model.predict_proba(X_test)
# 并输出测试数据集的精确率、召回率、F1值、AUC值,画出ROC曲线
print('精准率:',precision_score(y_test,best_H))
print('召回率:',recall_score(y_test,best_H))
print('F1率:',f1_score(y_test,best_H))
print('AUC:',roc_auc_score(y_test,best_yH[:,-1]))
fpr,tpr,theta = roc_curve(y_test,best_yH[:,-1])
print('fpr=n',fpr)
print('tpr=n',tpr)
print('theta=n',theta)
#画出ROC曲线
plt.plot(fpr,tpr)
plt.show()
from sklearn.neighbors import KNeighborsClassifier
k_range = range(1, 5)  # 优化参数k的取值范围
weight_options = ['uniform', 'distance']  # 代估参数权重的取值范围。uniform为统一取权值,distance表示距离倒数取权值
# 下面是构建parameter grid,其结构是key为参数名称,value是待搜索的数值列表的一个字典结构
param_grid = {'classifier__n_neighbors':k_range,'classifier__weights':weight_options}  # 定义优化参数字典,字典中的key值必须是分类算法的函数的参数名
print(param_grid)
 
#KNeighborsClassifier K最近邻分类器
#n_neighbors查询邻居数,默认就是5
knn = KNeighborsClassifier(n_neighbors=5)  # 定义分类算法。n_neighbors和weights的参数名称和param_grid字典中的key名对应
 
 
# ================================网格搜索=======================================
# 这里GridSearchCV的参数形式和cross_val_score的形式差不多,其中param_grid是parameter grid所对应的参数
# GridSearchCV中的n_jobs设置为-1时,可以实现并行计算(如果你的电脑支持的情况下)
grid = GridSearchCV(estimator = knn, param_grid = param_grid, cv=10, scoring='accuracy') #针对每个参数对进行了10次交叉验证。scoring='accuracy'使用准确率为结果的度量指标。可以添加多个度量指标
grid.fit(X, y)
 
print('网格搜索-度量记录:',grid.cv_results_)  # 包含每次训练的相关信息
print('网格搜索-最佳度量值:',grid.best_score_)  # 获取最佳度量值
print('网格搜索-最佳参数:',grid.best_params_)  # 获取最佳度量值时的代定参数的值。是一个字典
print('网格搜索-最佳模型:',grid.best_estimator_)  # 获取最佳度量时的分类器模型
#KNeighbors模型结果可视化
#from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score,recall_score,f1_score,roc_auc_score,roc_curve
import matplotlib.pyplot as plt

best_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', KNeighborsClassifier(3))])
best_model.fit(X_train, y_train)
best_H = best_model.predict(X_test)
best_yH = best_model.predict_proba(X_test)
# 并输出测试数据集的精确率、召回率、F1值、AUC值,画出ROC曲线
print('精准率:',precision_score(y_test,best_H))
print('召回率:',recall_score(y_test,best_H))
print('F1率:',f1_score(y_test,best_H))
print('AUC:',roc_auc_score(y_test,best_yH[:,-1]))
fpr,tpr,theta = roc_curve(y_test,best_yH[:,-1])
print('fpr=n',fpr)
print('tpr=n',tpr)
print('theta=n',theta)
#画出ROC曲线
plt.plot(fpr,tpr)
plt.show()
#通过模型自动搜索找到最好超参数后,可以通过画learning curve看看模型是否过拟合
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

def plot_learning_curve(estimator,title,X,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(0.1,1.0,5)):
    plt.title(title)#图像标题
    if ylim is not None:#y轴限制不为空时
        plt.ylim(*ylim)
    plt.xlabel("Training examples")#两个标题
    plt.ylabel("Score")
    train_sizes,train_scores,test_scores=learning_curve(estimator,X,y,cv=cv,n_jobs=n_jobs,train_sizes=train_sizes)#获取训练集大小,训练得分集合,测试得分集合
    train_scores_mean=np.mean(train_scores,axis=1)#将训练得分集合按行的到平均值
    train_scores_std=np.std(train_scores,axis=1)#计算训练矩阵的标准方差
    test_scores_mean=np.mean(test_scores,axis=1)
    test_scores_std=np.std(test_scores,axis=1)
    plt.grid()#背景设置为网格线
    
    plt.fill_between(train_sizes,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.1,color='r')
    # plt.fill_between()函数会把模型准确性的平均值的上下方差的空间里用颜色填充。
    plt.fill_between(train_sizes,test_scores_mean-test_scores_std,test_scores_mean+test_scores_std,alpha=0.1,color='g')
    plt.plot(train_sizes,train_scores_mean,'o-',color='r',label='Training score')

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
#交叉验证类进行十次迭代,测试集占0.2,其余的都是训练集
titles = ['GBDT_Learning Curves', 'KNN_Learning Curves', 'RF_Learning Curves']
degrees = [1, 2, 3]#多项式的阶数
plt.figure(figsize=(18, 4), dpi=200)#设置画布大小,dpi是每英寸的像素点数
for i in range(len(degrees)):#循环三次
    plt.subplot(1, 3, i + 1)#下属三张画布,对应编号为i+1
    if(i==0):
        best_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', GradientBoostingClassifier(learning_rate=0.05, n_estimators=1200,max_depth=7, min_samples_leaf =60, 
               min_samples_split =1200, max_features=9, subsample=0.7, random_state=10))])
        plot_learning_curve(best_model, titles[i], X_train, y_train, ylim=(0.75, 1.01), cv=cv)#开始绘制曲线
    elif(i==1):
        best_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', KNeighborsClassifier(3))])
        plot_learning_curve(best_model, titles[i], X_train, y_train, ylim=(0.75, 1.01), cv=cv)#开始绘制曲线
    else:
        best_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(criterion='entropy', max_depth= 8, max_features='sqrt', n_estimators= 500))])
        plot_learning_curve(best_model, titles[i], X_train, y_train, ylim=(0.75, 1.01), cv=cv)#开始绘制曲线

plt.show()#显示

#多模型比对选择
#模型选择,通过模型自动搜索,选出最好的模型超参数,设置好参数后对比多个模型的最终效果,选出最好的模型
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
classifiers = [
    KNeighborsClassifier(3),
    #SVC(kernel="rbf", C=0.025, probability=True),
    #NuSVC(probability=True),
    #DecisionTreeClassifier(),
    RandomForestClassifier(criterion='entropy', max_depth= 8, max_features='sqrt', n_estimators= 500),
    #AdaBoostClassifier(),
    GradientBoostingClassifier(learning_rate=0.05, n_estimators=1200,max_depth=7, min_samples_leaf =60, 
               min_samples_split =1200, max_features=9, subsample=0.7, random_state=10)
    ]
for classifier in classifiers:
    pipe = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', classifier)])
    pipe.fit(X_train, y_train)   
    print(classifier)
    print("model score: %.3f" % pipe.score(X_test, y_test))
    #print("model auc: %.3f" % pipe.accuracy_score(X_test, y_test))

#模型集成&可视化
values = {'buy_freq':1.0}
X_resample1 = X_resample.fillna(value = values)
X_resample1[['isbuyer','multiple_buy','multiple_visit']] = X_resample1[['isbuyer','multiple_buy','multiple_visit']].astype(int)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
                              GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.pipeline import make_pipeline
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
 
np.random.seed(10)
n_estimator = 10
 
#X, y = make_classification(n_samples=80000)
X_train, X_test, y_train, y_test = train_test_split(X_resample1,y_resample, test_size=0.2)
#To avoid overfitting
X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train, y_train, test_size=0.2)
 
def RandomForestLR():
	rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
	#rf = Pipeline(steps=[('preprocessor', preprocessor),('classifier', RandomForestClassifier(criterion='entropy', max_depth= 8, max_features='sqrt', n_estimators= 500))])
	rf_enc = OneHotEncoder()
	rf_lr = LogisticRegression()
	rf.fit(X_train, y_train)
	rf_enc.fit(rf.apply(X_train))
	rf_lr.fit(rf_enc.transform(rf.apply(X_train_lr)), y_train_lr)
	y_pred_rf_lr = rf_lr.predict_proba(rf_enc.transform(rf.apply(X_test)))[:, 1]
	fpr_rf_lr, tpr_rf_lr, _ = roc_curve(y_test, y_pred_rf_lr)
	auc = roc_auc_score(y_test, y_pred_rf_lr)
	print("RF+LR:", auc)
	return fpr_rf_lr, tpr_rf_lr
 
def GdbtLR():
	grd = GradientBoostingClassifier(n_estimators=n_estimator)
	#grd = Pipeline(steps=[('preprocessor', preprocessor),('classifier', GradientBoostingClassifier(learning_rate=0.05, n_estimators=1200,max_depth=7, min_samples_leaf =60, min_samples_split =1200, max_features=9, subsample=0.7, random_state=10))])
	grd_enc = OneHotEncoder()
	grd_lr = LogisticRegression()
	grd.fit(X_train, y_train)
	grd_enc.fit(grd.apply(X_train)[:, :, 0])
	grd_lr.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)
	y_pred_grd_lr = grd_lr.predict_proba(grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
	fpr_grd_lr, tpr_grd_lr, _ = roc_curve(y_test, y_pred_grd_lr)
	auc = roc_auc_score(y_test, y_pred_grd_lr) 
	print("GDBT+LR:", auc)
	return fpr_grd_lr, tpr_grd_lr
 
def Xgboost():
	xgboost = xgb.XGBClassifier(nthread=4, learning_rate=0.08,n_estimators=50, max_depth=5, gamma=0, subsample=0.9, colsample_bytree=0.5)
	#xgboost = Pipeline(steps=[('preprocessor', preprocessor),('classifier', xgb.XGBClassifier(nthread=4, learning_rate=0.08,n_estimators=50, max_depth=5, gamma=0, subsample=0.9, colsample_bytree=0.5))])
	xgboost.fit(X_train, y_train)
	y_xgboost_test = xgboost.predict_proba(X_test)[:, 1]
	fpr_xgboost, tpr_xgboost, _ = roc_curve(y_test, y_xgboost_test)
	auc = roc_auc_score(y_test, y_xgboost_test)
	print("Xgboost:", auc)
	return fpr_xgboost, tpr_xgboost
 
def Lr():
	lm = LogisticRegression(n_jobs=4, C=0.1, penalty='l2')
	lm.fit(X_train, y_train)
	y_lr_test = lm.predict_proba(X_test)[:, 1]
	fpr_lr, tpr_lr, _ = roc_curve(y_test, y_lr_test)
	auc = roc_auc_score(y_test, y_lr_test)
	print("LR:", auc)
	return fpr_lr, tpr_lr
 
def XgboostLr():
	xgboost = xgb.XGBClassifier(nthread=4, learning_rate=0.08, n_estimators=50, max_depth=5, gamma=0, subsample=0.9, colsample_bytree=0.5)
	#xgboost = Pipeline(steps=[('preprocessor', preprocessor),('classifier', xgb.XGBClassifier(nthread=4, learning_rate=0.08,n_estimators=50, max_depth=5, gamma=0, subsample=0.9, colsample_bytree=0.5))])
	xgb_enc = OneHotEncoder()
	xgb_lr = LogisticRegression(n_jobs=4, C=0.1, penalty='l2')
	xgboost.fit(X_train, y_train)
 
	xgb_enc.fit(xgboost.apply(X_train)[:, :])
	xgb_lr.fit(xgb_enc.transform(xgboost.apply(X_train_lr)[:, :]), y_train_lr)
	y_xgb_lr_test = xgb_lr.predict_proba(xgb_enc.transform(xgboost.apply(X_test)[:,:]))[:, 1]
	fpr_xgb_lr, tpr_xgb_lr, _ = roc_curve(y_test, y_xgb_lr_test)
	auc = roc_auc_score(y_test, y_xgb_lr_test)
	print("Xgboost + LR:", auc)
	return fpr_xgb_lr, tpr_xgb_lr
 
if __name__ == '__main__':
	fpr_rf_lr, tpr_rf_lr = RandomForestLR()
	fpr_grd_lr, tpr_grd_lr = GdbtLR()
	fpr_xgboost, tpr_xgboost = Xgboost()
	#fpr_lr, tpr_lr = Lr()
	fpr_xgb_lr, tpr_xgb_lr = XgboostLr()
 
	plt.figure(1)
	plt.plot([0, 1], [0, 1], 'k--')
	plt.plot(fpr_rf_lr, tpr_rf_lr, label='RF + LR')
	plt.plot(fpr_grd_lr, tpr_grd_lr, label='GBT + LR')
	plt.plot(fpr_xgboost, tpr_xgboost, label='XGB')
	#plt.plot(fpr_lr, tpr_lr, label='LR')
	plt.plot(fpr_xgb_lr, tpr_xgb_lr, label='XGB + LR')
	plt.xlabel('False positive rate')
	plt.ylabel('True positive rate')
	plt.title('ROC curve')
	plt.legend(loc='best')
	plt.show()
 
	plt.figure(2)
	plt.xlim(0, 0.2)
	plt.ylim(0.8, 1)
	plt.plot([0, 1], [0, 1], 'k--')
	plt.plot(fpr_rf_lr, tpr_rf_lr, label='RF + LR')
	plt.plot(fpr_grd_lr, tpr_grd_lr, label='GBT + LR')
	plt.plot(fpr_xgboost, tpr_xgboost, label='XGB')
	#plt.plot(fpr_lr, tpr_lr, label='LR')
	plt.plot(fpr_xgb_lr, tpr_xgb_lr, label='XGB + LR')
	plt.xlabel('False positive rate')
	plt.ylabel('True positive rate')
	plt.title('ROC curve (zoomed in at top left)')
	plt.legend(loc='best')
	plt.show()

#test数据测试&模型保存
test_csv[['isbuyer','multiple_buy','multiple_visit']] = test_csv[['isbuyer','multiple_buy','multiple_visit']].astype(object)
best_model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', GradientBoostingClassifier(learning_rate=0.05, n_estimators=1200,max_depth=7, min_samples_leaf =60, 
               min_samples_split =1200, max_features=9, subsample=0.7, random_state=10))])
best_model.fit(X_train, y_train)
best_predict = best_model.predict(test_csv)
test_csv['predict'] = best_predict
best_predprob = best_mode.predict_proba(test_csv)[:,1]
test_csv['predict_prob'] = best_predprob
test_csv.to_csv('./ads_test_predict.csv',header = 'true')
发布了27 篇原创文章 · 获赞 16 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/liangwqi/article/details/103880232