版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/LY_ysys629/article/details/74012491
赛题简介
1) 使用工具,python,pandas,sklearn
2)思路:利用前一天的用户商品对的交互行为统计量,预测今天的用户商品对的购买行为。
需注意:特征变量的时间窗口为天,分类数据点是用户商品对即(user_id - item_id),该篇博客,主要介绍参赛的流程,未对具体的特征构造、特征处理,特征选择做详细说明
从该篇博客,你可以获得,从官网下载数据、处理数据、提取每个用户商品对每天的4种交互行为特征量,形成训练测试数据集,到模型的训练,测试,预测等全流程。同时给出,sklearn对分类样本不均衡问题的解决方法。让你真正体验到脚踏实地的做天猫移动推荐赛的感觉。
step1:查看、处理user表格
import pandas as pd
import numpy as np
%time userAll = pd.read_csv('E:/python/gbdt/fresh_comp_offline/tianchi_fresh_comp_train_user.csv',\
usecols = ['user_id','item_id','behavior_type','time'])
Wall time: 14.2 s
userAll.head()
|
user_id |
item_id |
behavior_type |
time |
0 |
10001082 |
285259775 |
1 |
2014-12-08 18 |
1 |
10001082 |
4368907 |
1 |
2014-12-12 12 |
2 |
10001082 |
4368907 |
1 |
2014-12-12 12 |
3 |
10001082 |
53616768 |
1 |
2014-12-02 15 |
4 |
10001082 |
151466952 |
1 |
2014-12-12 11 |
userAll.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23291027 entries, 0 to 23291026
Data columns (total 4 columns):
user_id int64
item_id int64
behavior_type int64
time object
dtypes: int64(3), object(1)
memory usage: 710.8+ MB
userAll.duplicated().sum()
11505107
step2:下载、查看、处理item子集表格
%time itemSub = pd.read_csv('tianchi_fresh_comp_train_item.csv',usecols = ['item_id'])
Wall time: 428 ms
itemSub.item_id.is_unique
False
itemSub.item_id.value_counts().head()
25013404 8724
311093202 5999
228198932 5597
238357777 5522
313822206 4517
Name: item_id, dtype: int64
itemSub.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 620918 entries, 0 to 620917
Data columns (total 1 columns):
item_id 620918 non-null int64
dtypes: int64(1)
memory usage: 4.7 MB
itemSub.duplicated().sum()
198060
itemSet = itemSub[['item_id']].drop_duplicates()
itemSet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 422858 entries, 0 to 620917
Data columns (total 1 columns):
item_id 422858 non-null int64
dtypes: int64(1)
memory usage: 6.5 MB
step3:取user与item子集上的交集
由于预测user-item(哪些用户买了哪些商品)是在item子集上进行,因此,可以自考虑user在这些商品子集上的交互行为,来预测user-item。
当然,还可以用全部的user表格,通过分析user在不同种类商品的交互行为,来预测user-item
%time userSub = pd.merge(userAll,itemSet,on = 'item_id',how = 'inner')
Wall time: 4.4 s
userSub.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2084859 entries, 0 to 2084858
Data columns (total 4 columns):
user_id int64
item_id int64
behavior_type int64
time object
dtypes: int64(3), object(1)
memory usage: 79.5+ MB
userSub.head()
|
user_id |
item_id |
behavior_type |
time |
0 |
10001082 |
275221686 |
1 |
2014-12-03 01 |
1 |
10001082 |
275221686 |
1 |
2014-12-13 14 |
2 |
10001082 |
275221686 |
1 |
2014-12-08 07 |
3 |
10001082 |
275221686 |
1 |
2014-12-08 07 |
4 |
10001082 |
275221686 |
1 |
2014-12-08 00 |
将该数据集保存到csv文件里
%time userSub.to_csv('userSub.csv')
Wall time: 4.53 s
step4:处理时间数据
读取userSub,(先保存userSub,在读取userSub,是更换index为time的一种间接方法,此外,userSub作为我们作预测的主要数据集,是需要保存的。)
%time userSub = pd.read_csv('userSub.csv',usecols = ['user_id','item_id','behavior_type','time'],parse_dates = True)
Wall time: 14 s
userSub.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2084859 entries, 0 to 2084858
Data columns (total 4 columns):
user_id int64
item_id int64
behavior_type int64
time object
dtypes: int64(3), object(1)
memory usage: 63.6+ MB
userSub.head()
|
user_id |
item_id |
behavior_type |
time |
0 |
10001082 |
275221686 |
1 |
2014-12-03 01 |
1 |
10001082 |
275221686 |
1 |
2014-12-13 14 |
2 |
10001082 |
275221686 |
1 |
2014-12-08 07 |
3 |
10001082 |
275221686 |
1 |
2014-12-08 07 |
4 |
10001082 |
275221686 |
1 |
2014-12-08 00 |
%time userSub = userSub.sort_index().copy()
Wall time: 66 ms
userSub.index
DatetimeIndex(['2014-11-18 00:00:00', '2014-11-18 00:00:00',
'2014-11-18 00:00:00', '2014-11-18 00:00:00',
'2014-11-18 00:00:00', '2014-11-18 00:00:00',
'2014-11-18 00:00:00', '2014-11-18 00:00:00',
'2014-11-18 00:00:00', '2014-11-18 00:00:00',
...
'2014-12-18 23:00:00', '2014-12-18 23:00:00',
'2014-12-18 23:00:00', '2014-12-18 23:00:00',
'2014-12-18 23:00:00', '2014-12-18 23:00:00',
'2014-12-18 23:00:00', '2014-12-18 23:00:00',
'2014-12-18 23:00:00', '2014-12-18 23:00:00'],
dtype='datetime64[ns]', name=u'time', length=2084859, freq=None)
userSub.head()
|
user_id |
item_id |
behavior_type |
time |
|
|
|
2014-11-18 |
129403050 |
52900329 |
1 |
2014-11-18 |
23246977 |
353606633 |
1 |
2014-11-18 |
140763800 |
369393023 |
1 |
2014-11-18 |
140763800 |
187769381 |
1 |
2014-11-18 |
32363170 |
134081514 |
1 |
step5:进行特征处理
特征处理包括两部分:
1)将user-item(用户商品对)的交互行为进行哑变量编码
2)设置时间窗口,提取交互行为的一段时间内的统计量
pd.get_dummies(userSub['behavior_type'],prefix = 'type').head()
|
type_1 |
type_2 |
type_3 |
type_4 |
0 |
1.0 |
0.0 |
0.0 |
0.0 |
1 |
1.0 |
0.0 |
0.0 |
0.0 |
2 |
1.0 |
0.0 |
0.0 |
0.0 |
3 |
1.0 |
0.0 |
0.0 |
0.0 |
4 |
1.0 |
0.0 |
0.0 |
0.0 |
typeDummies = pd.get_dummies(userSub['behavior_type'],prefix = 'type')
userSubOneHot = pd.concat([userSub[['user_id','item_id','time']],typeDummies],axis = 1)
usertem = pd.concat([userSub[['user_id','item_id']],typeDummies,userSub[['time']]],axis = 1)
usertem.head()
|
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4 |
time |
0 |
10001082 |
275221686 |
1.0 |
0.0 |
0.0 |
0.0 |
2014-12-03 01 |
1 |
10001082 |
275221686 |
1.0 |
0.0 |
0.0 |
0.0 |
2014-12-13 14 |
2 |
10001082 |
275221686 |
1.0 |
0.0 |
0.0 |
0.0 |
2014-12-08 07 |
3 |
10001082 |
275221686 |
1.0 |
0.0 |
0.0 |
0.0 |
2014-12-08 07 |
4 |
10001082 |
275221686 |
1.0 |
0.0 |
0.0 |
0.0 |
2014-12-08 00 |
usertem.groupby(['time','user_id','item_id'],as_index = False).sum().head()
|
time |
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4 |
0 |
2014-11-18 00 |
1409053 |
58649567 |
2.0 |
0.0 |
0.0 |
0.0 |
1 |
2014-11-18 00 |
1446949 |
2432119 |
3.0 |
0.0 |
0.0 |
0.0 |
2 |
2014-11-18 00 |
1446949 |
206833072 |
2.0 |
0.0 |
0.0 |
0.0 |
3 |
2014-11-18 00 |
1446949 |
347745633 |
1.0 |
0.0 |
0.0 |
0.0 |
4 |
2014-11-18 00 |
2903578 |
395200199 |
2.0 |
0.0 |
0.0 |
0.0 |
userSubOneHot.head()
|
user_id |
item_id |
time |
type_1 |
type_2 |
type_3 |
type_4 |
0 |
10001082 |
275221686 |
2014-12-03 01 |
1.0 |
0.0 |
0.0 |
0.0 |
1 |
10001082 |
275221686 |
2014-12-13 14 |
1.0 |
0.0 |
0.0 |
0.0 |
2 |
10001082 |
275221686 |
2014-12-08 07 |
1.0 |
0.0 |
0.0 |
0.0 |
3 |
10001082 |
275221686 |
2014-12-08 07 |
1.0 |
0.0 |
0.0 |
0.0 |
4 |
10001082 |
275221686 |
2014-12-08 00 |
1.0 |
0.0 |
0.0 |
0.0 |
userSubOneHot.info()
userSubOneHotGroup = userSubOneHot.groupby(['time','user_id','item_id'],as_index = False).sum()
userSubOneHotGroup.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 968243 entries, 0 to 968242
Data columns (total 7 columns):
time 968243 non-null object
user_id 968243 non-null int64
item_id 968243 non-null int64
type_1 968243 non-null float64
type_2 968243 non-null float64
type_3 968243 non-null float64
type_4 968243 non-null float64
dtypes: float64(4), int64(2), object(1)
memory usage: 59.1+ MB
userSubOneHotGroup.head()
|
time |
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4 |
0 |
2014-11-18 00 |
1409053 |
58649567 |
2.0 |
0.0 |
0.0 |
0.0 |
1 |
2014-11-18 00 |
1446949 |
2432119 |
3.0 |
0.0 |
0.0 |
0.0 |
2 |
2014-11-18 00 |
1446949 |
206833072 |
2.0 |
0.0 |
0.0 |
0.0 |
3 |
2014-11-18 00 |
1446949 |
347745633 |
1.0 |
0.0 |
0.0 |
0.0 |
4 |
2014-11-18 00 |
2903578 |
395200199 |
2.0 |
0.0 |
0.0 |
0.0 |
拆分天和小时
userSubOneHotGroup['time_day'] = pd.to_datetime(userSubOneHotGroup.time.values).date
userSubOneHotGroup['time_hour'] = pd.to_datetime(userSubOneHotGroup.time.values).time
userSubOneHotGroup.head()
|
time |
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4 |
time_day |
time_hour |
0 |
2014-11-18 00 |
1409053 |
58649567 |
2.0 |
0.0 |
0.0 |
0.0 |
2014-11-18 |
00:00:00 |
1 |
2014-11-18 00 |
1446949 |
2432119 |
3.0 |
0.0 |
0.0 |
0.0 |
2014-11-18 |
00:00:00 |
2 |
2014-11-18 00 |
1446949 |
206833072 |
2.0 |
0.0 |
0.0 |
0.0 |
2014-11-18 |
00:00:00 |
3 |
2014-11-18 00 |
1446949 |
347745633 |
1.0 |
0.0 |
0.0 |
0.0 |
2014-11-18 |
00:00:00 |
4 |
2014-11-18 00 |
2903578 |
395200199 |
2.0 |
0.0 |
0.0 |
0.0 |
2014-11-18 |
00:00:00 |
dataHour = userSubOneHotGroup.ix[:,0:7]
dataHour.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 968243 entries, 0 to 968242
Data columns (total 7 columns):
time 968243 non-null object
user_id 968243 non-null int64
item_id 968243 non-null int64
type_1 968243 non-null float64
type_2 968243 non-null float64
type_3 968243 non-null float64
type_4 968243 non-null float64
dtypes: float64(4), int64(2), object(1)
memory usage: 59.1+ MB
dataHour.to_csv('dataHour.csv')
dataHour.duplicated().sum()
0
dataDay = userSubOneHotGroup.groupby(['time_day','user_id','item_id'],as_index = False).sum()
dataDay.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 904397 entries, 0 to 904396
Data columns (total 7 columns):
time_day 904397 non-null object
user_id 904397 non-null int64
item_id 904397 non-null int64
type_1 904397 non-null float64
type_2 904397 non-null float64
type_3 904397 non-null float64
type_4 904397 non-null float64
dtypes: float64(4), int64(2), object(1)
memory usage: 55.2+ MB
dataDay.head()
|
time_day |
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4 |
0 |
2014-11-18 |
492 |
76093985 |
1.0 |
0.0 |
0.0 |
0.0 |
1 |
2014-11-18 |
492 |
110036513 |
2.0 |
0.0 |
0.0 |
0.0 |
2 |
2014-11-18 |
492 |
176404510 |
1.0 |
0.0 |
0.0 |
0.0 |
3 |
2014-11-18 |
492 |
178412255 |
2.0 |
0.0 |
0.0 |
0.0 |
4 |
2014-11-18 |
492 |
335961429 |
1.0 |
0.0 |
0.0 |
0.0 |
dataDay.to_csv('dataDay.csv')
dataDay.duplicated().sum()
0
dataDay.type_4.max()
20.0
step6:构造训练测试数据集
本篇博客使用的采样频率为天的数据表,对每个用户商品对进行是否发生购买行为进行分类,发生购买行为分类标签为1,反之为0.
dataDay_load = pd.read_csv('dataDay.csv',usecols = ['time_day','user_id','item_id','type_1',\
'type_2','type_3','type_4'], index_col = 'time_day',parse_dates = True)
dataDay_load.head()
|
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4 |
time_day |
|
|
|
|
|
|
2014-11-18 |
492 |
76093985 |
1.0 |
0.0 |
0.0 |
0.0 |
2014-11-18 |
492 |
110036513 |
2.0 |
0.0 |
0.0 |
0.0 |
2014-11-18 |
492 |
176404510 |
1.0 |
0.0 |
0.0 |
0.0 |
2014-11-18 |
492 |
178412255 |
2.0 |
0.0 |
0.0 |
0.0 |
2014-11-18 |
492 |
335961429 |
1.0 |
0.0 |
0.0 |
0.0 |
dataDay_load.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 904397 entries, 2014-11-18 to 2014-12-18
Data columns (total 6 columns):
user_id 904397 non-null int64
item_id 904397 non-null int64
type_1 904397 non-null float64
type_2 904397 non-null float64
type_3 904397 non-null float64
type_4 904397 non-null float64
dtypes: float64(4), int64(2)
memory usage: 48.3 MB
train_x = dataDay_load.ix['2014-12-16',:]
train_x.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 30183 entries, 2014-12-16 to 2014-12-16
Data columns (total 6 columns):
user_id 30183 non-null int64
item_id 30183 non-null int64
type_1 30183 non-null float64
type_2 30183 non-null float64
type_3 30183 non-null float64
type_4 30183 non-null float64
dtypes: float64(4), int64(2)
memory usage: 1.6 MB
train_x.describe()
|
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4 |
count |
3.018300e+04 |
3.018300e+04 |
30183.000000 |
30183.000000 |
30183.000000 |
30183.000000 |
mean |
7.186918e+07 |
2.032869e+08 |
2.181890 |
0.036776 |
0.058278 |
0.026803 |
std |
4.595509e+07 |
1.172341e+08 |
1.352044 |
0.189442 |
0.241651 |
0.174335 |
min |
5.943600e+04 |
1.540200e+04 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
25% |
3.000949e+07 |
1.014034e+08 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
50% |
5.858117e+07 |
2.036895e+08 |
2.000000 |
0.000000 |
0.000000 |
0.000000 |
75% |
1.178801e+08 |
3.056362e+08 |
3.000000 |
0.000000 |
0.000000 |
0.000000 |
max |
1.424396e+08 |
4.045617e+08 |
17.000000 |
2.000000 |
4.000000 |
5.000000 |
train_y = dataDay_load.ix['2014-12-17',['user_id','item_id','type_4']]
train_y.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 29749 entries, 2014-12-17 to 2014-12-17
Data columns (total 3 columns):
user_id 29749 non-null int64
item_id 29749 non-null int64
type_4 29749 non-null float64
dtypes: float64(1), int64(2)
memory usage: 929.7 KB
train_y.describe()
|
user_id |
item_id |
type_4 |
count |
2.974900e+04 |
2.974900e+04 |
29749.000000 |
mean |
6.997416e+07 |
2.016876e+08 |
0.024202 |
std |
4.685978e+07 |
1.170012e+08 |
0.165070 |
min |
5.943600e+04 |
6.619000e+03 |
0.000000 |
25% |
2.783149e+07 |
9.903570e+07 |
0.000000 |
50% |
5.562218e+07 |
2.005868e+08 |
0.000000 |
75% |
1.176616e+08 |
3.039699e+08 |
0.000000 |
max |
1.424157e+08 |
4.045616e+08 |
4.000000 |
dataSet = pd.merge(train_x,train_y, on = ['user_id','item_id'],suffixes=('_x','_y'), how = 'left').fillna(0.0)
dataSet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30183 entries, 0 to 30182
Data columns (total 7 columns):
user_id 30183 non-null int64
item_id 30183 non-null int64
type_1 30183 non-null float64
type_2 30183 non-null float64
type_3 30183 non-null float64
type_4_x 30183 non-null float64
type_4_y 30183 non-null float64
dtypes: float64(5), int64(2)
memory usage: 1.8 MB
dataSet.describe()
|
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4_x |
type_4_y |
count |
3.018300e+04 |
3.018300e+04 |
30183.000000 |
30183.000000 |
30183.000000 |
30183.000000 |
30183.000000 |
mean |
7.186918e+07 |
2.032869e+08 |
2.181890 |
0.036776 |
0.058278 |
0.026803 |
0.004705 |
std |
4.595509e+07 |
1.172341e+08 |
1.352044 |
0.189442 |
0.241651 |
0.174335 |
0.075343 |
min |
5.943600e+04 |
1.540200e+04 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
25% |
3.000949e+07 |
1.014034e+08 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
50% |
5.858117e+07 |
2.036895e+08 |
2.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
75% |
1.178801e+08 |
3.056362e+08 |
3.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
max |
1.424396e+08 |
4.045617e+08 |
17.000000 |
2.000000 |
4.000000 |
5.000000 |
3.000000 |
np.sign(dataSet.type_4_y.values).sum()
129.0
np.sign(0.0)
0.0
dataSet['labels'] = dataSet.type_4_y.map(lambda x: 1.0 if x > 0.0 else 0.0 )
dataSet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30183 entries, 0 to 30182
Data columns (total 8 columns):
user_id 30183 non-null int64
item_id 30183 non-null int64
type_1 30183 non-null float64
type_2 30183 non-null float64
type_3 30183 non-null float64
type_4_x 30183 non-null float64
type_4_y 30183 non-null float64
labels 30183 non-null float64
dtypes: float64(6), int64(2)
memory usage: 2.1 MB
dataSet.head()
|
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4_x |
type_4_y |
labels |
0 |
59436 |
184081436 |
4.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
61797 |
83261906 |
3.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
134211 |
6491625 |
2.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
3 |
134211 |
79679783 |
2.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
4 |
134211 |
96616269 |
2.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
np.sign(dataSet.type_3.values).sum()
1713.0
trainSet = dataSet.copy()
trainSet.to_csv('trainSet.csv')
test_x = dataDay_load.ix['2014-12-17',:]
test_x.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 29749 entries, 2014-12-17 to 2014-12-17
Data columns (total 6 columns):
user_id 29749 non-null int64
item_id 29749 non-null int64
type_1 29749 non-null float64
type_2 29749 non-null float64
type_3 29749 non-null float64
type_4 29749 non-null float64
dtypes: float64(4), int64(2)
memory usage: 1.6 MB
test_x.head()
|
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4 |
time_day |
|
|
|
|
|
|
2014-12-17 |
59436 |
238861461 |
2.0 |
0.0 |
0.0 |
0.0 |
2014-12-17 |
60723 |
202829025 |
2.0 |
0.0 |
0.0 |
0.0 |
2014-12-17 |
60723 |
371933634 |
2.0 |
0.0 |
0.0 |
0.0 |
2014-12-17 |
106362 |
38830684 |
1.0 |
0.0 |
0.0 |
0.0 |
2014-12-17 |
106362 |
149517272 |
2.0 |
0.0 |
0.0 |
0.0 |
test_y = dataDay_load.ix['2014-12-18',['user_id','item_id','type_4']]
testSet = pd.merge(test_x,test_y, on = ['user_id','item_id'],suffixes=('_x','_y'), how = 'left').fillna(0.0)
testSet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 29749 entries, 0 to 29748
Data columns (total 7 columns):
user_id 29749 non-null int64
item_id 29749 non-null int64
type_1 29749 non-null float64
type_2 29749 non-null float64
type_3 29749 non-null float64
type_4_x 29749 non-null float64
type_4_y 29749 non-null float64
dtypes: float64(5), int64(2)
memory usage: 1.8 MB
testSet.describe()
|
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4_x |
type_4_y |
count |
2.974900e+04 |
2.974900e+04 |
29749.000000 |
29749.000000 |
29749.000000 |
29749.000000 |
29749.000000 |
mean |
6.997416e+07 |
2.016876e+08 |
2.168241 |
0.038153 |
0.059296 |
0.024202 |
0.004336 |
std |
4.685978e+07 |
1.170012e+08 |
1.334966 |
0.192093 |
0.242364 |
0.165070 |
0.069681 |
min |
5.943600e+04 |
6.619000e+03 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
25% |
2.783149e+07 |
9.903570e+07 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
50% |
5.562218e+07 |
2.005868e+08 |
2.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
75% |
1.176616e+08 |
3.039699e+08 |
3.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
max |
1.424157e+08 |
4.045616e+08 |
17.000000 |
2.000000 |
3.000000 |
4.000000 |
3.000000 |
testSet['labels'] = testSet.type_4_y.map(lambda x: 1.0 if x > 0.0 else 0.0 )
testSet.describe()
|
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4_x |
type_4_y |
labels |
count |
2.974900e+04 |
2.974900e+04 |
29749.000000 |
29749.000000 |
29749.000000 |
29749.000000 |
29749.000000 |
29749.000000 |
mean |
6.997416e+07 |
2.016876e+08 |
2.168241 |
0.038153 |
0.059296 |
0.024202 |
0.004336 |
0.004101 |
std |
4.685978e+07 |
1.170012e+08 |
1.334966 |
0.192093 |
0.242364 |
0.165070 |
0.069681 |
0.063909 |
min |
5.943600e+04 |
6.619000e+03 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
25% |
2.783149e+07 |
9.903570e+07 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
50% |
5.562218e+07 |
2.005868e+08 |
2.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
75% |
1.176616e+08 |
3.039699e+08 |
3.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
max |
1.424157e+08 |
4.045616e+08 |
17.000000 |
2.000000 |
3.000000 |
4.000000 |
3.000000 |
1.000000 |
testSet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 29749 entries, 0 to 29748
Data columns (total 8 columns):
user_id 29749 non-null int64
item_id 29749 non-null int64
type_1 29749 non-null float64
type_2 29749 non-null float64
type_3 29749 non-null float64
type_4_x 29749 non-null float64
type_4_y 29749 non-null float64
labels 29749 non-null float64
dtypes: float64(6), int64(2)
memory usage: 2.0 MB
testSet['labels'].values.sum()
122.0
testSet.to_csv('testSet.csv')
step7: 训练模型
逻辑回归模型
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
model.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99565980850147429
train_y_est =model.predict(trainSet.ix[:,2:6])
train_y_est.sum()
2.0
加权逻辑回归(针对类别不平衡,基于代价敏感函数)
lrW = LogisticRegression(class_weight ='auto')
lrW.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
trainLRW_y = lrW.predict(trainSet.ix[:,2:6])
trainLRW_y.sum()
4792.0
lrW.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.84292482523274692
计算训练精确率和召回率
from sklearn.cross_validation import train_test_split,cross_val_score
precisions = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
cv = 5,scoring = 'precision')
print "精确度:\n",np.mean(precisions)
精确度: 0.0217883289288
recalls = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
cv = 5,scoring = 'recall')
print "召回率:\n",np.mean(recalls)
召回率: 0.651692307692
f1 = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
cv = 5,scoring = 'f1')
print 'f1得分:\n',np.mean(f1)
f1得分:
0.0421179159024
计算测试f1得分
testLRW_y = lrW.predict(test_x.ix[:,2:6])
precision_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'precision')
recall_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'recall')
f1_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'f1')
print 'f1得分:\n',np.mean(f1_test)
f1得分:
0.0447302553442
step8:预测19号用户商品对
predict_x = dataDay_load.ix['2014-12-18',:]
predict_x.to_csv('predict_x.csv')
predict_x.info()
predict_x.describe()
|
user_id |
item_id |
type_1 |
type_2 |
type_3 |
type_4 |
count |
2.894900e+04 |
2.894900e+04 |
28949.000000 |
28949.000000 |
28949.000000 |
28949.000000 |
mean |
7.057660e+07 |
2.041997e+08 |
2.172476 |
0.033784 |
0.065080 |
0.026978 |
std |
4.636984e+07 |
1.167057e+08 |
1.317706 |
0.181628 |
0.254529 |
0.174941 |
min |
1.342110e+05 |
2.934200e+04 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
25% |
2.902911e+07 |
1.037920e+08 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
50% |
5.540191e+07 |
2.049606e+08 |
2.000000 |
0.000000 |
0.000000 |
0.000000 |
75% |
1.178487e+08 |
3.065798e+08 |
3.000000 |
0.000000 |
0.000000 |
0.000000 |
max |
1.424116e+08 |
4.045373e+08 |
16.000000 |
2.000000 |
3.000000 |
4.000000 |
predict_y = lrW.predict(predict_x.ix[:,2:])
predict_y.sum()
4636.0
user_item_19 = predict_x.ix[predict_y > 0.0,['user_id','item_id']]
user_item_19.all()
user_id True
item_id True
dtype: bool
user_item_19.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4636 entries, 2014-12-18 to 2014-12-18
Data columns (total 2 columns):
user_id 4636 non-null int64
item_id 4636 non-null int64
dtypes: int64(2)
memory usage: 108.7 KB
user_item_19.duplicated().sum()
0
user_item_19.to_csv('E:/python/gbdt/predict/tianchi_mobile_recommendation_predict.csv',index = False,encoding = 'utf-8')
其他sklearn模型的应用
GBDT模型
from sklearn.ensemble import GradientBoostingClassifier
gbdt = GradientBoostingClassifier(random_state = 10)
gbdt.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
trainGBDT_y = gbdt.predict(trainSet.ix[:,2:6])
trainGBDT_y.sum()
0.0
gbdt.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99572607096710064
随机森林
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
trainRF_y = rf.predict(trainSet.ix[:,2:6])
rf.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99575920219991387
trainRF_y.sum()
1.0
trainRF_y
array([ 0., 0., 0., ..., 0., 0., 0.])
svm
from sklearn import svm
svc = svm.SVC()
svc.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
trainSVC_y = svc.predict(trainSet.ix[:,2:6])
svc.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99572607096710064
trainSVC_y.sum()
0.0
很显然样本不均衡问题对模型的影响很大
关于分类样本不均衡问题的介绍及解决办法
不平衡数据分类算法介绍与比较
集成学习以及分类样本不均衡问题
关于超参数调节
scikit-learn 梯度提升树(GBDT)调参小结