前言
一直想找了一个比较简单的比赛来找找机器学习方面应用的手感,我属于有一些基础知识的人,以前读研是主攻NLP的情感分析方面的,机器学习并没有系统的学习,于是在天池上找了一个相对简单的比赛练练手,希望可以帮助到刚刚入门的同学们,后面有源码分析,本文主要讲解特征提取,后续会不断完善。
赛题介绍
简单描述这个比赛:预测用户使用优惠卷的概率
提供的语料集大家去官网看下详情,其中训练集大概有几十万,训练集规模还是比较大,这里介绍一下输出
输出:
Field | Description |
---|---|
User_id | 用户ID |
Coupon_id | 优惠券ID |
Date_received | 领取优惠券日期 |
Probability | 15天内用券概率,由参赛选手给出 |
这里输出的User_id、Coupon_id、Date_received都是官方提供的,算是测试集,最后的Probability是你模型针对测试集得到的结果,每一条记录都有一个结果,这样的结果提交到官网用来验证你的结果。
思想
首先跟大家讲解一下,这里的预测不同于以往的预测,因为有两点原因:
所以这两个条件难以让我们像传统的任务那样,你必须思考如何提取特征,并且如何来预测。
这里介绍比赛中第一名中队伍的思想,作者在这里把数据进行拆分,用训练集的一部分来提取特征,按时间划分,大概是3个半月的语料集作为训练集,来预测训练集中下一个月用户使用优惠卷的概率,这样针对训练集,作者将语料拆开,拆成两份,用来训练模型,并且,在最后预测时候,把训练集中的3个半月的数据作为提取特征的训练集,用来预测测试集中用户使用优惠卷的概率,这样的思路。这里的重点是分类结果怎么判断,作者在这里将结果分为三类:
虽然作者在这里将分类结果设为离散值,单并不影响模型的预测结果。接下来本人在这里对其代码进行了处理,不过也是看了第一名的代码后总结后的代码逻辑,某些代码块被我封装为函数,代码更加健壮可读。
提取特征
首先明知官方提供这么多的训练集的基础上,我们肯定是用监督学习的方案来训练模型,这里介绍一下第一名的提取特征的思想,这里提取特征我也很惊奇,给大家介绍一下。如果大家去了解了训练集后就大概知道这样几种关系。
1、用户线下数据
2、用户线上数据
3、商家发送优惠卷数据
4、用户与商家之间的数据关系
5、优惠卷本身的信息
6、其他特征
在这些里面很多特征我也很佩服,这是怎么梳理的,大致是分实体、渠道两方面,实体包括用户、商家、优惠卷、用户与商家,渠道分为线上、线下。其实大家从这些关系中还可以去总结用户-优惠卷之间的关系,这个也知道大家考虑,具体的特征描述如下:
用户线下相关的特征
- 用户领取优惠券次数
- 用户获得优惠券但没有消费的次数
- 用户获得优惠券并核销次数
- 用户领取优惠券后进行核销率
- 用户满0
50/50200/200~500 减的优惠券核销率 - 用户核销满0
50/50200/200~500减的优惠券占所有核销优惠券的比重 - 用户核销优惠券的平均/最低/最高消费折率
- 用户核销过优惠券的不同商家数量,及其占所有不同商家的比重
- 用户核销过的不同优惠券数量,及其占所有不同优惠券的比重
- 用户平均核销每个商家多少张优惠券
- 用户核销优惠券中的平均/最大/最小用户-商家距离
用户线上相关的特征
- 用户线上操作次数
- 用户线上点击率
- 用户线上购买率
- 用户线上领取率
- 用户线上不消费次数
- 用户线上优惠券核销次数
- 用户线上优惠券核销率
- 用户线下不消费次数占线上线下总的不消费次数的比重
- 用户线下的优惠券核销次数占线上线下总的优惠券核销次数的比重
- 用户线下领取的记录数量占总的记录数量的比重
商家相关的特征
- 商家优惠券被领取次数
- 商家优惠券被领取后不核销次数
- 商家优惠券被领取后核销次数
- 商家优惠券被领取后核销率
- 商家优惠券核销的平均/最小/最大消费折率
- 核销商家优惠券的不同用户数量,及其占领取不同的用户比重
- 商家优惠券平均每个用户核销多少张
- 商家被核销过的不同优惠券数量
- 商家被核销过的不同优惠券数量占所有领取过的不同优惠券数量的比重
- 商家平均每种优惠券核销多少张
- 商家被核销优惠券的平均时间率
- 商家被核销优惠券中的平均/最小/最大用户-商家距离
用户-商家交互特征
- 用户领取商家的优惠券次数
- 用户领取商家的优惠券后不核销次数
- 用户领取商家的优惠券后核销次数
- 用户领取商家的优惠券后核销率
- 用户对每个商家的不核销次数占用户总的不核销次数的比重
- 用户对每个商家的优惠券核销次数占用户总的核销次数的比重
- 用户对每个商家的不核销次数占商家总的不核销次数的比重
- 用户对每个商家的优惠券核销次数占商家总的核销次数的比重
优惠券相关的特征
- 优惠券类型(直接优惠为0, 满减为1)
- 优惠券折率
- 满减优惠券的最低消费
- 历史出现次数
- 历史核销次数
- 历史核销率
- 历史核销时间率
- 领取优惠券是一周的第几天
- 领取优惠券是一月的第几天
- 历史上用户领取该优惠券次数
- 历史上用户消费该优惠券次数
- 历史上用户对该优惠券的核销率
其它特征
这部分特征利用了赛题leakage,都是在预测区间提取的。
- 用户领取的所有优惠券数目
- 用户领取的特定优惠券数目
- 用户此次之后/前领取的所有优惠券数目
- 用户此次之后/前领取的特定优惠券数目
- 用户上/下一次领取的时间间隔
- 用户领取特定商家的优惠券数目
- 用户领取的不同商家数目
- 用户当天领取的优惠券数目
- 用户当天领取的特定优惠券数目
- 用户领取的所有优惠券种类数目
- 商家被领取的优惠券数目
- 商家被领取的特定优惠券数目
- 商家被多少不同用户领取的数目
- 商家发行的所有优惠券种类数目
提取特征Python代码
import pandas as pd
import numpy as np
from datetime import date
def get_day_gap_before(s):
date_received,dates = s.split('-')
dates = dates.split(':')
gaps = []
for d in dates:
this_gap = (date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))-date(int(d[0:4]),int(d[4:6]),int(d[6:8]))).days
if this_gap>0:
gaps.append(this_gap)
if len(gaps)==0:
return -1
else:
return min(gaps)
def get_day_gap_after(s):
date_received,dates = s.split('-')
dates = dates.split(':')
gaps = []
for d in dates:
this_gap = (date(int(d[0:4]),int(d[4:6]),int(d[6:8]))-date(int(date_received[0:4]),int(date_received[4:6]),int(date_received[6:8]))).days
if this_gap>0:
gaps.append(this_gap)
if len(gaps)==0:
return -1
else:
return min(gaps)
def is_firstlastone(x):
if x==0:
return 1
elif x>0:
return 0
else:
return -1 #those only receive once
def extract_feature(dataset,flag):
# other feature: 提取各类特征
# 1、 this_month_user_receive_all_coupon_count
# 2、 this_month_user_receive_same_coupon_count
# 3、 this_month_user_receive_same_coupon_lastone
# 4、 this_month_user_receive_same_coupon_firstone
# 5、 this_day_user_receive_all_coupon_count
# 6、 this_day_user_receive_same_coupon_count
# 7、 day_gap_before、day_gap_after (receive the same coupon)
#
t = dataset[['user_id']]
t['this_month_user_receive_all_coupon_count'] = 1 # 增加一列,下面groupby .agg('sum') 把出现的个数 赋值给这个新增列
t = t.groupby('user_id').agg('sum').reset_index() # 按照user_id来聚合,agg('sum'):user_id出现的次数作为值
t1 = dataset[['user_id','coupon_id']]
t1['this_month_user_receive_same_coupon_count'] = 1
t1 = t1.groupby(['user_id','coupon_id']).agg('sum').reset_index()
t2 = dataset[['user_id','coupon_id','date_received']]
t2.date_received = t2.date_received.astype('str')
t2 = t2.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index()
t2['receive_number'] = t2.date_received.apply(lambda s:len(s.split(':')))
t2 = t2[t2.receive_number>1]
t2['max_date_received'] = t2.date_received.apply(lambda s:max([int(d) for d in s.split(':')]))
t2['min_date_received'] = t2.date_received.apply(lambda s:min([int(d) for d in s.split(':')]))
t2 = t2[['user_id','coupon_id','max_date_received','min_date_received']]
t3 = dataset[['user_id','coupon_id','date_received']]
t3 = pd.merge(t3,t2,on=['user_id','coupon_id'],how='left')
if flag == 1:
t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - t3.date_received
t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received - t3.min_date_received
else :
t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - t3.date_received.astype('int')
t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received.astype('int') - t3.min_date_received
t3.this_month_user_receive_same_coupon_lastone = t3.this_month_user_receive_same_coupon_lastone.apply(is_firstlastone)
t3.this_month_user_receive_same_coupon_firstone = t3.this_month_user_receive_same_coupon_firstone.apply(is_firstlastone)
t3 = t3[['user_id','coupon_id','date_received','this_month_user_receive_same_coupon_lastone','this_month_user_receive_same_coupon_firstone']]
t4 = dataset[['user_id','date_received']]
t4['this_day_user_receive_all_coupon_count'] = 1
t4 = t4.groupby(['user_id','date_received']).agg('sum').reset_index()
t5 = dataset[['user_id','coupon_id','date_received']]
t5['this_day_user_receive_same_coupon_count'] = 1
t5 = t5.groupby(['user_id','coupon_id','date_received']).agg('sum').reset_index()
t6 = dataset[['user_id','coupon_id','date_received']]
t6.date_received = t6.date_received.astype('str')
t6 = t6.groupby(['user_id','coupon_id'])['date_received'].agg(lambda x:':'.join(x)).reset_index()
t6.rename(columns={'date_received':'dates'},inplace=True)
t7 = dataset[['user_id','coupon_id','date_received']]
t7 = pd.merge(t7,t6,on=['user_id','coupon_id'],how='left')
t7['date_received_date'] = t7.date_received.astype('str') + '-' + t7.dates
t7['day_gap_before'] = t7.date_received_date.apply(get_day_gap_before)
t7['day_gap_after'] = t7.date_received_date.apply(get_day_gap_after)
t7 = t7[['user_id','coupon_id','date_received','day_gap_before','day_gap_after']]
other_feature = pd.merge(t1,t,on='user_id')
other_feature = pd.merge(other_feature,t3,on=['user_id','coupon_id'])
other_feature = pd.merge(other_feature,t4,on=['user_id','date_received'])
other_feature = pd.merge(other_feature,t5,on=['user_id','coupon_id','date_received'])
other_feature = pd.merge(other_feature,t7,on=['user_id','coupon_id','date_received'])
return other_feature
fileDir="D:/workspace/gitWorkSpace/O2O-Coupon-Usage-Forecast-master/O2O-Coupon-Usage-Forecast-master/code/wepon/season one/"
fileDir="D:/workspace/gitWorkSpace/O2O-Coupon-Usage-Forecast-master/O2O-Coupon-Usage-Forecast-master/code/wepon/data1/"
# 线下数据读取 --线下消费和优惠券领取行为--
off_train = pd.read_csv(fileDir+'data/ccf_offline_stage1_train.csv',header=None,keep_default_na = False)
off_train.columns = ['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']
#线上数据读取 用户线上点击/消费和优惠券领取行为
on_train = pd.read_csv(fileDir+'data/ccf_online_stage1_train.csv',header=None,keep_default_na = False)
on_train.columns = ['user_id','merchant_id','action','coupon_id','discount_rate','date_received','date']
# 用户O2O线下优惠券使用预测样本
off_test = pd.read_csv(fileDir+'data/ccf_offline_stage1_test_revised.csv',header=None,keep_default_na = False)
off_test.columns = ['user_id','merchant_id','coupon_id','discount_rate','distance','date_received']
把数据集拆分,用来交叉验证。
# 用3个半月的数据来提取特征(1月1号-4月13号),后面一个月的消费行为(4月14号-5月14号)来训练预测模型
dataset1 = off_train[(off_train.date_received>='20160414')&(off_train.date_received<='20160514')]
feature1 = off_train[(off_train.date>='20160101')&(off_train.date<='20160413')|((off_train.date=='null')&(off_train.date_received>='20160101')&(off_train.date_received<='20160413'))]
#print(feature1)
#print("------------------------111-------------------------")
# 用3个半月的数据来提取特征(2月1号-5月14号),后面一个月的消费行为(7月1号-7月31号)来训练预测模型
feature2 = off_train[(off_train.date>='20160201')&(off_train.date<='20160514')|((off_train.date=='null')&(off_train.date_received>='20160201')&(off_train.date_received<='20160514'))]
dataset2 = off_train[(off_train.date_received>='20160515')&(off_train.date_received<='20160615')]
#print(feature2)
#print("------------------------222-------------------------")
# 用3个半月的数据来提取特征(3月15号-6月30号),预测后面一个月的消费行为(7月1号-7月31号),数据是测试集
feature3 = off_train[((off_train.date>='20160315')&(off_train.date<='20160630'))|((off_train.date=='null')&(off_train.date_received>='20160315')&(off_train.date_received<='20160630'))]
dataset3 = off_test #一个月的测试集,用来预测
#print(feature3)
优惠卷相关特征
############# coupon related feature #############
# """
# 2.coupon related:
# discount_rate. discount_man. discount_jian. is_man_jian
# day_of_week,day_of_month. (date_received)
# """
def calc_discount_rate(s):
s =str(s)
s = s.split(':')
if len(s)==1:
return float(s[0])
else:
return 1.0-float(s[1])/float(s[0])
def get_discount_man(s):
s =str(s)
s = s.split(':')
if len(s)==1:
return 'null'
else:
return int(s[0])
def get_discount_jian(s):
s =str(s)
s = s.split(':')
if len(s)==1:
return 'null'
else:
return int(s[1])
def is_man_jian(s):
s =str(s)
s = s.split(':')
if len(s)==1:
return 0
else:
return 1
def coupon_process(dataset):
dataset['day_of_week'] = dataset.date_received.astype('str').apply(lambda x:date(int(x[0:4]),int(x[4:6]),int(x[6:8])).weekday()+1)
dataset['day_of_month'] = dataset.date_received.astype('str').apply(lambda x:int(x[6:8]))
dataset['days_distance'] = dataset.date_received.astype('str').apply(lambda x:(date(int(x[0:4]),int(x[4:6]),int(x[6:8]))-date(2016,5,14)).days)
dataset['discount_man'] = dataset.discount_rate.apply(get_discount_man)
dataset['discount_jian'] = dataset.discount_rate.apply(get_discount_jian)
dataset['is_man_jian'] = dataset.discount_rate.apply(is_man_jian)
dataset['discount_rate'] = dataset.discount_rate.apply(calc_discount_rate)
d = dataset[['coupon_id']]
d['coupon_count'] = 1
d = d.groupby('coupon_id').agg('sum').reset_index()
return pd.merge(dataset,d,on='coupon_id',how='left')
dataset1=coupon_process(dataset1)
dataset1.to_csv(fileDir+'data/coupon1_feature.csv',index=None)
dataset2=coupon_process(dataset2)
dataset2.to_csv(fileDir+'data/coupon2_feature.csv',index=None)
#print(dataset3)
#print("--------------------")
dataset3=coupon_process(dataset3)
dataset3.to_csv(fileDir+'data/coupon3_feature.csv',index=None)
商家相关特征
############# merchant related feature #############
#"""
#1.merchant related:
# total_sales. sales_use_coupon. total_coupon
# coupon_rate = sales_use_coupon/total_sales.
# transfer_rate = sales_use_coupon/total_coupon.
# merchant_avg_distance,merchant_min_distance,merchant_max_distance of those use coupon
#"""
def merchant_related(feature):
merchant = feature[['merchant_id','coupon_id','distance','date_received','date']]
t = merchant[['merchant_id']]
t.drop_duplicates(inplace=True)
t1 = merchant[merchant.date!='null'][['merchant_id']]
t1['total_sales'] = 1
t1 = t1.groupby('merchant_id').agg('sum').reset_index()
t2 = merchant[(merchant.date!='null')&(merchant.coupon_id!='null')][['merchant_id']]
t2['sales_use_coupon'] = 1
t2 = t2.groupby('merchant_id').agg('sum').reset_index()
t3 = merchant[merchant.coupon_id!='null'][['merchant_id']]
t3['total_coupon'] = 1
t3 = t3.groupby('merchant_id').agg('sum').reset_index()
t4 = merchant[(merchant.date!='null')&(merchant.coupon_id!='null')][['merchant_id','distance']]
t4.replace('null',-1,inplace=True)
t4.distance = t4.distance.astype('int')
t4.replace(-1,np.nan,inplace=True)
t5 = t4.groupby('merchant_id').agg('min').reset_index()
t5.rename(columns={'distance':'merchant_min_distance'},inplace=True)
t6 = t4.groupby('merchant_id').agg('max').reset_index()
t6.rename(columns={'distance':'merchant_max_distance'},inplace=True)
t7 = t4.groupby('merchant_id').agg('mean').reset_index()
t7.rename(columns={'distance':'merchant_mean_distance'},inplace=True)
t8 = t4.groupby('merchant_id').agg('median').reset_index()
t8.rename(columns={'distance':'merchant_median_distance'},inplace=True)
merchant_feature = pd.merge(t,t1,on='merchant_id',how='left')
merchant_feature = pd.merge(merchant_feature,t2,on='merchant_id',how='left')
merchant_feature = pd.merge(merchant_feature,t3,on='merchant_id',how='left')
merchant_feature = pd.merge(merchant_feature,t5,on='merchant_id',how='left')
merchant_feature = pd.merge(merchant_feature,t6,on='merchant_id',how='left')
merchant_feature = pd.merge(merchant_feature,t7,on='merchant_id',how='left')
merchant_feature = pd.merge(merchant_feature,t8,on='merchant_id',how='left')
merchant_feature.sales_use_coupon = merchant_feature.sales_use_coupon.replace(np.nan,0) #fillna with 0
merchant_feature['merchant_coupon_transfer_rate'] = merchant_feature.sales_use_coupon.astype('float') / merchant_feature.total_coupon
merchant_feature['coupon_rate'] = merchant_feature.sales_use_coupon.astype('float') / merchant_feature.total_sales
merchant_feature.total_coupon = merchant_feature.total_coupon.replace(np.nan,0) #fillna with 0
return merchant_feature
merchant_feature1 = merchant_related(feature1)
merchant_feature1.to_csv(fileDir+'data/merchant1_feature.csv',index=None)
merchant_feature2 = merchant_related(feature2)
merchant_feature2.to_csv(fileDir+'data/merchant2_feature.csv',index=None)
merchant_feature3 = merchant_related(feature3)
merchant_feature3.to_csv(fileDir+'data/merchant3_feature.csv',index=None)
用户特征:
############# user related feature #############
#"""
# 2.user related:
# count_merchant.
# user_avg_distance, user_min_distance,user_max_distance.
# buy_use_coupon. buy_total. coupon_received.
# buy_use_coupon/coupon_received.
# buy_use_coupon/buy_total
# user_date_datereceived_gap
#
#"""
def get_user_date_datereceived_gap(s):
#print("---:",s)
s = s.split(':')
yy=int(s[1][0:4])
dd=int(s[1][4:6])
hh=int(s[1][6:8])
return (date(int(s[0][0:4]),int(s[0][4:6]),int(s[0][6:8])) - date(yy,dd,hh)).days
def user_related(feature):
user = feature[['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']]
# merchant_id :商户ID
t = user[['user_id']]
t.drop_duplicates(inplace=True) # 按照user_id 分组,统计user_id不同的个数,user_id 重复的取第一个出现的下表作为新的坐标
t1 = user[user.date!='null'][['user_id','merchant_id']]
t1.drop_duplicates(inplace=True)
t1.merchant_id = 1
t1 = t1.groupby('user_id').agg('sum').reset_index()
t1.rename(columns={'merchant_id':'count_merchant'},inplace=True)
t2 = user[(user.date!='null')&(user.coupon_id!='null')][['user_id','distance']]
t2.replace('null',-1,inplace=True)
t2.distance = t2.distance.astype('int')
t2.replace(-1,np.nan,inplace=True)
t3 = t2.groupby('user_id').agg('min').reset_index()
t3.rename(columns={'distance':'user_min_distance'},inplace=True)
t4 = t2.groupby('user_id').agg('max').reset_index()
t4.rename(columns={'distance':'user_max_distance'},inplace=True)
t5 = t2.groupby('user_id').agg('mean').reset_index()
t5.rename(columns={'distance':'user_mean_distance'},inplace=True)
t6 = t2.groupby('user_id').agg('median').reset_index()
t6.rename(columns={'distance':'user_median_distance'},inplace=True)
t7 = user[(user.date!='null')&(user.coupon_id!='null')][['user_id']]
t7['buy_use_coupon'] = 1
t7 = t7.groupby('user_id').agg('sum').reset_index()
t8 = user[user.date!='null'][['user_id']]
t8['buy_total'] = 1
t8 = t8.groupby('user_id').agg('sum').reset_index()
t9 = user[user.coupon_id!='null'][['user_id']]
t9['coupon_received'] = 1
t9 = t9.groupby('user_id').agg('sum').reset_index()
t10 = user[(user.date_received!='null')&(user.date!='null')][['user_id','date_received','date']]
#t10.date_received.replace(np.nan,"0",inplace=True)
print(t10.date_received)
#np.where(np.isnan(df))[0] #将nan地方全部替换成0
#t10.date[np.where(np.isnan(df))[1]]
t10['user_date_datereceived_gap'] = t10.date + ':' + t10.date_received
print(t10)
t10.user_date_datereceived_gap = t10.user_date_datereceived_gap.apply(get_user_date_datereceived_gap)
t10 = t10[['user_id','user_date_datereceived_gap']]
t11 = t10.groupby('user_id').agg('mean').reset_index()
t11.rename(columns={'user_date_datereceived_gap':'avg_user_date_datereceived_gap'},inplace=True)
t12 = t10.groupby('user_id').agg('min').reset_index()
t12.rename(columns={'user_date_datereceived_gap':'min_user_date_datereceived_gap'},inplace=True)
t13 = t10.groupby('user_id').agg('max').reset_index()
t13.rename(columns={'user_date_datereceived_gap':'max_user_date_datereceived_gap'},inplace=True)
user_feature = pd.merge(t,t1,on='user_id',how='left')
user_feature = pd.merge(user_feature,t3,on='user_id',how='left')
user_feature = pd.merge(user_feature,t4,on='user_id',how='left')
user_feature = pd.merge(user_feature,t5,on='user_id',how='left')
user_feature = pd.merge(user_feature,t6,on='user_id',how='left')
user_feature = pd.merge(user_feature,t7,on='user_id',how='left')
user_feature = pd.merge(user_feature,t8,on='user_id',how='left')
user_feature = pd.merge(user_feature,t9,on='user_id',how='left')
user_feature = pd.merge(user_feature,t11,on='user_id',how='left')
user_feature = pd.merge(user_feature,t12,on='user_id',how='left')
user_feature = pd.merge(user_feature,t13,on='user_id',how='left')
user_feature.count_merchant = user_feature.count_merchant.replace(np.nan,0)
user_feature.buy_use_coupon = user_feature.buy_use_coupon.replace(np.nan,0)
user_feature['buy_use_coupon_rate'] = user_feature.buy_use_coupon.astype('float') / user_feature.buy_total.astype('float')
user_feature['user_coupon_transfer_rate'] = user_feature.buy_use_coupon.astype('float') / user_feature.coupon_received.astype('float')
user_feature.buy_total = user_feature.buy_total.replace(np.nan,0)
user_feature.coupon_received = user_feature.coupon_received.replace(np.nan,0)
return user_feature
print(feature1)
print('----')
user_feature1 = user_related(feature1)
user_feature1.to_csv(fileDir+'data/user1_feature.csv',index=None)
user_feature2 = user_related(feature2)
user_feature2.to_csv(fileDir+'data/user2_feature.csv',index=None)
user_feature3 = user_related(feature3)
user_feature3.to_csv(fileDir+'data/user3_feature.csv',index=None)
用户-商家特征:
#"""
#4.user_merchant:
# times_user_buy_merchant_before.
#"""
def user_merchant(feature):
all_user_merchant = feature[['user_id','merchant_id']]
all_user_merchant.drop_duplicates(inplace=True)
t = feature[['user_id','merchant_id','date']]
t = t[t.date!='null'][['user_id','merchant_id']]
t['user_merchant_buy_total'] = 1
t = t.groupby(['user_id','merchant_id']).agg('sum').reset_index()
t.drop_duplicates(inplace=True)
t1 = feature[['user_id','merchant_id','coupon_id']]
t1 = t1[t1.coupon_id!='null'][['user_id','merchant_id']]
t1['user_merchant_received'] = 1
t1 = t1.groupby(['user_id','merchant_id']).agg('sum').reset_index()
t1.drop_duplicates(inplace=True)
t2 = feature[['user_id','merchant_id','date','date_received']]
t2 = t2[(t2.date!='null')&(t2.date_received!='null')][['user_id','merchant_id']]
t2['user_merchant_buy_use_coupon'] = 1
t2 = t2.groupby(['user_id','merchant_id']).agg('sum').reset_index()
t2.drop_duplicates(inplace=True)
t3 = feature[['user_id','merchant_id']]
t3['user_merchant_any'] = 1
t3 = t3.groupby(['user_id','merchant_id']).agg('sum').reset_index()
t3.drop_duplicates(inplace=True)
t4 = feature[['user_id','merchant_id','date','coupon_id']]
t4 = t4[(t4.date!='null')&(t4.coupon_id=='null')][['user_id','merchant_id']]
t4['user_merchant_buy_common'] = 1
t4 = t4.groupby(['user_id','merchant_id']).agg('sum').reset_index()
t4.drop_duplicates(inplace=True)
user_merchant = pd.merge(all_user_merchant,t,on=['user_id','merchant_id'],how='left')
user_merchant = pd.merge(user_merchant,t1,on=['user_id','merchant_id'],how='left')
user_merchant = pd.merge(user_merchant,t2,on=['user_id','merchant_id'],how='left')
user_merchant = pd.merge(user_merchant,t3,on=['user_id','merchant_id'],how='left')
user_merchant = pd.merge(user_merchant,t4,on=['user_id','merchant_id'],how='left')
user_merchant.user_merchant_buy_use_coupon = user_merchant.user_merchant_buy_use_coupon.replace(np.nan,0)
user_merchant.user_merchant_buy_common = user_merchant.user_merchant_buy_common.replace(np.nan,0)
user_merchant['user_merchant_coupon_transfer_rate'] = user_merchant.user_merchant_buy_use_coupon.astype('float') / user_merchant.user_merchant_received.astype('float')
user_merchant['user_merchant_coupon_buy_rate'] = user_merchant.user_merchant_buy_use_coupon.astype('float') / user_merchant.user_merchant_buy_total.astype('float')
user_merchant['user_merchant_rate'] = user_merchant.user_merchant_buy_total.astype('float') / user_merchant.user_merchant_any.astype('float')
user_merchant['user_merchant_common_buy_rate'] = user_merchant.user_merchant_buy_common.astype('float') / user_merchant.user_merchant_buy_total.astype('float')
return user_merchant
user_merchant1=user_merchant(feature1)
user_merchant1.to_csv(fileDir+'data/user_merchant1.csv',index=None)
user_merchant2=user_merchant(feature2)
user_merchant2.to_csv(fileDir+'data/user_merchant2.csv',index=None)
user_merchant3=user_merchant(feature3)
user_merchant3.to_csv(fileDir+'data/user_merchant3.csv',index=None)
生成最终的特征,
################## generate training and testing set ################
def get_label(s):
s = s.split(':')
if s[0]=='null':
return 0
elif (date(int(s[0][0:4]),int(s[0][4:6]),int(s[0][6:8]))-date(int(s[1][0:4]),int(s[1][4:6]),int(s[1][6:8]))).days<=15:
return 1
else:
return -1
def total_feature(dataset):
dataset.user_merchant_buy_total = dataset.user_merchant_buy_total.replace(np.nan,0)
dataset.user_merchant_any = dataset.user_merchant_any.replace(np.nan,0)
dataset.user_merchant_received = dataset.user_merchant_received.replace(np.nan,0)
dataset['is_weekend'] = dataset.day_of_week.apply(lambda x:1 if x in (6,7) else 0)
weekday_dummies = pd.get_dummies(dataset.day_of_week)
weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])]
dataset = pd.concat([dataset,weekday_dummies],axis=1)
dataset['label'] = dataset.date.astype('str') + ':' + dataset.date_received.astype('str')
dataset.label = dataset.label.apply(get_label)
dataset.drop(['merchant_id','day_of_week','date','date_received','coupon_id','coupon_count'],axis=1,inplace=True)
dataset = dataset.replace('null',np.nan)
return dataset
## dataset1
coupon1 = pd.read_csv(fileDir+'data/coupon1_feature.csv',keep_default_na=False)
merchant1 = pd.read_csv(fileDir+'data/merchant1_feature.csv',keep_default_na=False)
user1 = pd.read_csv(fileDir+'data/user1_feature.csv',keep_default_na=False)
user_merchant1 = pd.read_csv(fileDir+'data/user_merchant1.csv',keep_default_na=False)
other_feature1 = pd.read_csv(fileDir+'data/other_feature1.csv',keep_default_na=False)
dataset1 = pd.merge(coupon1,merchant1,on='merchant_id',how='left')
dataset1 = pd.merge(dataset1,user1,on='user_id',how='left')
dataset1 = pd.merge(dataset1,user_merchant1,on=['user_id','merchant_id'],how='left')
dataset1 = pd.merge(dataset1,other_feature1,on=['user_id','coupon_id','date_received'],how='left')
dataset1.drop_duplicates(inplace=True)
print(dataset1.shape)
dataset1 = total_feature(dataset1)
dataset1.to_csv(fileDir+'data/dataset1.csv',index=None)
## dataset2
coupon2 = pd.read_csv(fileDir+'data/coupon2_feature.csv',keep_default_na=False)
merchant2 = pd.read_csv(fileDir+'data/merchant2_feature.csv',keep_default_na=False)
user2 = pd.read_csv(fileDir+'data/user2_feature.csv',keep_default_na=False)
user_merchant2 = pd.read_csv(fileDir+'data/user_merchant2.csv',keep_default_na=False)
other_feature2 = pd.read_csv(fileDir+'data/other_feature2.csv',keep_default_na=False)
dataset2 = pd.merge(coupon2,merchant2,on='merchant_id',how='left')
dataset2 = pd.merge(dataset2,user2,on='user_id',how='left')
dataset2 = pd.merge(dataset2,user_merchant2,on=['user_id','merchant_id'],how='left')
dataset2 = pd.merge(dataset2,other_feature2,on=['user_id','coupon_id','date_received'],how='left')
dataset2.drop_duplicates(inplace=True)
print(dataset2.shape)
dataset2 = total_feature(dataset2)
dataset2.to_csv(fileDir+'data/dataset2.csv',index=None)
## dataset3
coupon3 = pd.read_csv(fileDir+'data/coupon3_feature.csv',keep_default_na=False)
merchant3 = pd.read_csv(fileDir+'data/merchant3_feature.csv',keep_default_na=False)
user3 = pd.read_csv(fileDir+'data/user3_feature.csv',keep_default_na=False)
user_merchant3 = pd.read_csv(fileDir+'data/user_merchant3.csv',keep_default_na=False)
other_feature3 = pd.read_csv(fileDir+'data/other_feature3.csv',keep_default_na=False)
dataset3 = pd.merge(coupon3,merchant3,on='merchant_id',how='left')
dataset3 = pd.merge(dataset3,user3,on='user_id',how='left')
dataset3 = pd.merge(dataset3,user_merchant3,on=['user_id','merchant_id'],how='left')
dataset3 = pd.merge(dataset3,other_feature3,on=['user_id','coupon_id','date_received'],how='left')
dataset3.drop_duplicates(inplace=True)
print(dataset3.shape)
dataset3.user_merchant_buy_total = dataset3.user_merchant_buy_total.replace(np.nan,0)
dataset3.user_merchant_any = dataset3.user_merchant_any.replace(np.nan,0)
dataset3.user_merchant_received = dataset3.user_merchant_received.replace(np.nan,0)
dataset3['is_weekend'] = dataset3.day_of_week.apply(lambda x:1 if x in (6,7) else 0)
weekday_dummies = pd.get_dummies(dataset3.day_of_week)
weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])]
dataset3 = pd.concat([dataset3,weekday_dummies],axis=1)
dataset3.drop(['merchant_id','day_of_week','coupon_count'],axis=1,inplace=True)
dataset3 = dataset3.replace('null',np.nan)
dataset3.to_csv(fileDir+'data/dataset3.csv',index=None)
print('------')