利用python-pandas和sklearn进行天池移动推荐离线赛的全过程

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/LY_ysys629/article/details/74012491

赛题简介

1) 使用工具,python,pandas,sklearn

2)思路:利用前一天的用户商品对的交互行为统计量,预测今天的用户商品对的购买行为。

需注意:特征变量的时间窗口为天,分类数据点是用户商品对即(user_id - item_id),该篇博客,主要介绍参赛的流程,未对具体的特征构造、特征处理,特征选择做详细说明
从该篇博客,你可以获得,从官网下载数据、处理数据、提取每个用户商品对每天的4种交互行为特征量,形成训练测试数据集,到模型的训练,测试,预测等全流程。同时给出,sklearn对分类样本不均衡问题的解决方法。让你真正体验到脚踏实地的做天猫移动推荐赛的感觉。

step1:查看、处理user表格

import pandas as pd
import numpy as np

%time userAll = pd.read_csv('E:/python/gbdt/fresh_comp_offline/tianchi_fresh_comp_train_user.csv',\
                      usecols = ['user_id','item_id','behavior_type','time'])
    Wall time: 14.2 s
userAll.head()#查看前五行原始数据
user_id item_id behavior_type time
0 10001082 285259775 1 2014-12-08 18
1 10001082 4368907 1 2014-12-12 12
2 10001082 4368907 1 2014-12-12 12
3 10001082 53616768 1 2014-12-02 15
4 10001082 151466952 1 2014-12-12 11
userAll.info()#查看数据表相关信息
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 23291027 entries, 0 to 23291026
    Data columns (total 4 columns):
    user_id          int64
    item_id          int64
    behavior_type    int64
    time             object
    dtypes: int64(3), object(1)
    memory usage: 710.8+ MB
userAll.duplicated().sum()#检查有无重复行
    11505107

step2:下载、查看、处理item子集表格

%time itemSub = pd.read_csv('tianchi_fresh_comp_train_item.csv',usecols = ['item_id'])
    Wall time: 428 ms
itemSub.item_id.is_unique#查看子集中商品item编号是否有重复
    False
itemSub.item_id.value_counts().head()#查看每个item_id有多少重复
    25013404     8724
    311093202    5999
    228198932    5597
    238357777    5522
    313822206    4517
    Name: item_id, dtype: int64
itemSub.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 620918 entries, 0 to 620917
    Data columns (total 1 columns):
    item_id    620918 non-null int64
    dtypes: int64(1)
    memory usage: 4.7 MB
itemSub.duplicated().sum()#查看重复的行数
    198060
itemSet = itemSub[['item_id']].drop_duplicates()#去除重复的行
itemSet.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 422858 entries, 0 to 620917
    Data columns (total 1 columns):
    item_id    422858 non-null int64
    dtypes: int64(1)
    memory usage: 6.5 MB

step3:取user与item子集上的交集

由于预测user-item(哪些用户买了哪些商品)是在item子集上进行,因此,可以自考虑user在这些商品子集上的交互行为,来预测user-item。

当然,还可以用全部的user表格,通过分析user在不同种类商品的交互行为,来预测user-item

%time userSub = pd.merge(userAll,itemSet,on = 'item_id',how = 'inner')
    Wall time: 4.4 s
userSub.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 2084859 entries, 0 to 2084858
    Data columns (total 4 columns):
    user_id          int64
    item_id          int64
    behavior_type    int64
    time             object
    dtypes: int64(3), object(1)
    memory usage: 79.5+ MB
userSub.head()
user_id item_id behavior_type time
0 10001082 275221686 1 2014-12-03 01
1 10001082 275221686 1 2014-12-13 14
2 10001082 275221686 1 2014-12-08 07
3 10001082 275221686 1 2014-12-08 07
4 10001082 275221686 1 2014-12-08 00

将该数据集保存到csv文件里

%time userSub.to_csv('userSub.csv')
    Wall time: 4.53 s

step4:处理时间数据

读取userSub,(先保存userSub,在读取userSub,是更换index为time的一种间接方法,此外,userSub作为我们作预测的主要数据集,是需要保存的。)

%time userSub = pd.read_csv('userSub.csv',usecols = ['user_id','item_id','behavior_type','time'],parse_dates = True)
    Wall time: 14 s
userSub.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 2084859 entries, 0 to 2084858
    Data columns (total 4 columns):
    user_id          int64
    item_id          int64
    behavior_type    int64
    time             object
    dtypes: int64(3), object(1)
    memory usage: 63.6+ MB
userSub.head()
user_id item_id behavior_type time
0 10001082 275221686 1 2014-12-03 01
1 10001082 275221686 1 2014-12-13 14
2 10001082 275221686 1 2014-12-08 07
3 10001082 275221686 1 2014-12-08 07
4 10001082 275221686 1 2014-12-08 00
%time userSub = userSub.sort_index().copy()
    Wall time: 66 ms
userSub.index
    DatetimeIndex(['2014-11-18 00:00:00', '2014-11-18 00:00:00',
                   '2014-11-18 00:00:00', '2014-11-18 00:00:00',
                   '2014-11-18 00:00:00', '2014-11-18 00:00:00',
                   '2014-11-18 00:00:00', '2014-11-18 00:00:00',
                   '2014-11-18 00:00:00', '2014-11-18 00:00:00',
                   ...
                   '2014-12-18 23:00:00', '2014-12-18 23:00:00',
                   '2014-12-18 23:00:00', '2014-12-18 23:00:00',
                   '2014-12-18 23:00:00', '2014-12-18 23:00:00',
                   '2014-12-18 23:00:00', '2014-12-18 23:00:00',
                   '2014-12-18 23:00:00', '2014-12-18 23:00:00'],
                  dtype='datetime64[ns]', name=u'time', length=2084859, freq=None)
userSub.head()
user_id item_id behavior_type
time
2014-11-18 129403050 52900329 1
2014-11-18 23246977 353606633 1
2014-11-18 140763800 369393023 1
2014-11-18 140763800 187769381 1
2014-11-18 32363170 134081514 1

step5:进行特征处理

特征处理包括两部分:

1)将user-item(用户商品对)的交互行为进行哑变量编码

2)设置时间窗口,提取交互行为的一段时间内的统计量

pd.get_dummies(userSub['behavior_type'],prefix = 'type').head()
type_1 type_2 type_3 type_4
0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0
2 1.0 0.0 0.0 0.0
3 1.0 0.0 0.0 0.0
4 1.0 0.0 0.0 0.0
typeDummies = pd.get_dummies(userSub['behavior_type'],prefix = 'type')#onehot哑变量编码

userSubOneHot = pd.concat([userSub[['user_id','item_id','time']],typeDummies],axis = 1)
usertem = pd.concat([userSub[['user_id','item_id']],typeDummies,userSub[['time']]],axis = 1)#将哑变量特征加入数据表中
usertem.head()
user_id item_id type_1 type_2 type_3 type_4 time
0 10001082 275221686 1.0 0.0 0.0 0.0 2014-12-03 01
1 10001082 275221686 1.0 0.0 0.0 0.0 2014-12-13 14
2 10001082 275221686 1.0 0.0 0.0 0.0 2014-12-08 07
3 10001082 275221686 1.0 0.0 0.0 0.0 2014-12-08 07
4 10001082 275221686 1.0 0.0 0.0 0.0 2014-12-08 00
usertem.groupby(['time','user_id','item_id'],as_index = False).sum().head()#已将关键字排序,统计用户商品对的交互行为
time user_id item_id type_1 type_2 type_3 type_4
0 2014-11-18 00 1409053 58649567 2.0 0.0 0.0 0.0
1 2014-11-18 00 1446949 2432119 3.0 0.0 0.0 0.0
2 2014-11-18 00 1446949 206833072 2.0 0.0 0.0 0.0
3 2014-11-18 00 1446949 347745633 1.0 0.0 0.0 0.0
4 2014-11-18 00 2903578 395200199 2.0 0.0 0.0 0.0
userSubOneHot.head()
user_id item_id time type_1 type_2 type_3 type_4
0 10001082 275221686 2014-12-03 01 1.0 0.0 0.0 0.0
1 10001082 275221686 2014-12-13 14 1.0 0.0 0.0 0.0
2 10001082 275221686 2014-12-08 07 1.0 0.0 0.0 0.0
3 10001082 275221686 2014-12-08 07 1.0 0.0 0.0 0.0
4 10001082 275221686 2014-12-08 00 1.0 0.0 0.0 0.0
userSubOneHot.info()
userSubOneHotGroup = userSubOneHot.groupby(['time','user_id','item_id'],as_index = False).sum()#另外一种方法是在sum()后使用.reset_index()
userSubOneHotGroup.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 968243 entries, 0 to 968242
    Data columns (total 7 columns):
    time       968243 non-null object
    user_id    968243 non-null int64
    item_id    968243 non-null int64
    type_1     968243 non-null float64
    type_2     968243 non-null float64
    type_3     968243 non-null float64
    type_4     968243 non-null float64
    dtypes: float64(4), int64(2), object(1)
    memory usage: 59.1+ MB
userSubOneHotGroup.head()
time user_id item_id type_1 type_2 type_3 type_4
0 2014-11-18 00 1409053 58649567 2.0 0.0 0.0 0.0
1 2014-11-18 00 1446949 2432119 3.0 0.0 0.0 0.0
2 2014-11-18 00 1446949 206833072 2.0 0.0 0.0 0.0
3 2014-11-18 00 1446949 347745633 1.0 0.0 0.0 0.0
4 2014-11-18 00 2903578 395200199 2.0 0.0 0.0 0.0

拆分天和小时


#time_day_Series = userSubOneHotGroup.time.map(lambda x:x.split(' ')[0])

#time_hour_Series = userSubOneHotGroup.time.map(lambda x:x.split(' ')[1])

userSubOneHotGroup['time_day'] = pd.to_datetime(userSubOneHotGroup.time.values).date

userSubOneHotGroup['time_hour'] = pd.to_datetime(userSubOneHotGroup.time.values).time

userSubOneHotGroup.head()
time user_id item_id type_1 type_2 type_3 type_4 time_day time_hour
0 2014-11-18 00 1409053 58649567 2.0 0.0 0.0 0.0 2014-11-18 00:00:00
1 2014-11-18 00 1446949 2432119 3.0 0.0 0.0 0.0 2014-11-18 00:00:00
2 2014-11-18 00 1446949 206833072 2.0 0.0 0.0 0.0 2014-11-18 00:00:00
3 2014-11-18 00 1446949 347745633 1.0 0.0 0.0 0.0 2014-11-18 00:00:00
4 2014-11-18 00 2903578 395200199 2.0 0.0 0.0 0.0 2014-11-18 00:00:00
dataHour = userSubOneHotGroup.ix[:,0:7]
dataHour.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 968243 entries, 0 to 968242
    Data columns (total 7 columns):
    time       968243 non-null object
    user_id    968243 non-null int64
    item_id    968243 non-null int64
    type_1     968243 non-null float64
    type_2     968243 non-null float64
    type_3     968243 non-null float64
    type_4     968243 non-null float64
    dtypes: float64(4), int64(2), object(1)
    memory usage: 59.1+ MB
#保存

dataHour.to_csv('dataHour.csv')
dataHour.duplicated().sum()#没有重复行
    0
dataDay = userSubOneHotGroup.groupby(['time_day','user_id','item_id'],as_index = False).sum()
dataDay.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 904397 entries, 0 to 904396
    Data columns (total 7 columns):
    time_day    904397 non-null object
    user_id     904397 non-null int64
    item_id     904397 non-null int64
    type_1      904397 non-null float64
    type_2      904397 non-null float64
    type_3      904397 non-null float64
    type_4      904397 non-null float64
    dtypes: float64(4), int64(2), object(1)
    memory usage: 55.2+ MB
dataDay.head()
time_day user_id item_id type_1 type_2 type_3 type_4
0 2014-11-18 492 76093985 1.0 0.0 0.0 0.0
1 2014-11-18 492 110036513 2.0 0.0 0.0 0.0
2 2014-11-18 492 176404510 1.0 0.0 0.0 0.0
3 2014-11-18 492 178412255 2.0 0.0 0.0 0.0
4 2014-11-18 492 335961429 1.0 0.0 0.0 0.0
#保存
dataDay.to_csv('dataDay.csv')
dataDay.duplicated().sum()#没有重复行
    0
dataDay.type_4.max()
    20.0

step6:构造训练测试数据集

本篇博客使用的采样频率为天的数据表,对每个用户商品对进行是否发生购买行为进行分类,发生购买行为分类标签为1,反之为0.

dataDay_load = pd.read_csv('dataDay.csv',usecols = ['time_day','user_id','item_id','type_1',\
                                                    'type_2','type_3','type_4'], index_col = 'time_day',parse_dates = True)
dataDay_load.head()
user_id item_id type_1 type_2 type_3 type_4
time_day
2014-11-18 492 76093985 1.0 0.0 0.0 0.0
2014-11-18 492 110036513 2.0 0.0 0.0 0.0
2014-11-18 492 176404510 1.0 0.0 0.0 0.0
2014-11-18 492 178412255 2.0 0.0 0.0 0.0
2014-11-18 492 335961429 1.0 0.0 0.0 0.0
dataDay_load.info()
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 904397 entries, 2014-11-18 to 2014-12-18
    Data columns (total 6 columns):
    user_id    904397 non-null int64
    item_id    904397 non-null int64
    type_1     904397 non-null float64
    type_2     904397 non-null float64
    type_3     904397 non-null float64
    type_4     904397 non-null float64
    dtypes: float64(4), int64(2)
    memory usage: 48.3 MB
train_x = dataDay_load.ix['2014-12-16',:]#16号选取特征数据集
train_x.info()
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 30183 entries, 2014-12-16 to 2014-12-16
    Data columns (total 6 columns):
    user_id    30183 non-null int64
    item_id    30183 non-null int64
    type_1     30183 non-null float64
    type_2     30183 non-null float64
    type_3     30183 non-null float64
    type_4     30183 non-null float64
    dtypes: float64(4), int64(2)
    memory usage: 1.6 MB
train_x.describe()
user_id item_id type_1 type_2 type_3 type_4
count 3.018300e+04 3.018300e+04 30183.000000 30183.000000 30183.000000 30183.000000
mean 7.186918e+07 2.032869e+08 2.181890 0.036776 0.058278 0.026803
std 4.595509e+07 1.172341e+08 1.352044 0.189442 0.241651 0.174335
min 5.943600e+04 1.540200e+04 0.000000 0.000000 0.000000 0.000000
25% 3.000949e+07 1.014034e+08 1.000000 0.000000 0.000000 0.000000
50% 5.858117e+07 2.036895e+08 2.000000 0.000000 0.000000 0.000000
75% 1.178801e+08 3.056362e+08 3.000000 0.000000 0.000000 0.000000
max 1.424396e+08 4.045617e+08 17.000000 2.000000 4.000000 5.000000
train_y = dataDay_load.ix['2014-12-17',['user_id','item_id','type_4']]#17号的购买行为作为分类标签
train_y.info()
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 29749 entries, 2014-12-17 to 2014-12-17
    Data columns (total 3 columns):
    user_id    29749 non-null int64
    item_id    29749 non-null int64
    type_4     29749 non-null float64
    dtypes: float64(1), int64(2)
    memory usage: 929.7 KB
train_y.describe()
user_id item_id type_4
count 2.974900e+04 2.974900e+04 29749.000000
mean 6.997416e+07 2.016876e+08 0.024202
std 4.685978e+07 1.170012e+08 0.165070
min 5.943600e+04 6.619000e+03 0.000000
25% 2.783149e+07 9.903570e+07 0.000000
50% 5.562218e+07 2.005868e+08 0.000000
75% 1.176616e+08 3.039699e+08 0.000000
max 1.424157e+08 4.045616e+08 4.000000
dataSet = pd.merge(train_x,train_y, on = ['user_id','item_id'],suffixes=('_x','_y'), how = 'left').fillna(0.0)#特征数据和标签数据构成训练数据集
dataSet.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 30183 entries, 0 to 30182
    Data columns (total 7 columns):
    user_id     30183 non-null int64
    item_id     30183 non-null int64
    type_1      30183 non-null float64
    type_2      30183 non-null float64
    type_3      30183 non-null float64
    type_4_x    30183 non-null float64
    type_4_y    30183 non-null float64
    dtypes: float64(5), int64(2)
    memory usage: 1.8 MB
dataSet.describe()
user_id item_id type_1 type_2 type_3 type_4_x type_4_y
count 3.018300e+04 3.018300e+04 30183.000000 30183.000000 30183.000000 30183.000000 30183.000000
mean 7.186918e+07 2.032869e+08 2.181890 0.036776 0.058278 0.026803 0.004705
std 4.595509e+07 1.172341e+08 1.352044 0.189442 0.241651 0.174335 0.075343
min 5.943600e+04 1.540200e+04 0.000000 0.000000 0.000000 0.000000 0.000000
25% 3.000949e+07 1.014034e+08 1.000000 0.000000 0.000000 0.000000 0.000000
50% 5.858117e+07 2.036895e+08 2.000000 0.000000 0.000000 0.000000 0.000000
75% 1.178801e+08 3.056362e+08 3.000000 0.000000 0.000000 0.000000 0.000000
max 1.424396e+08 4.045617e+08 17.000000 2.000000 4.000000 5.000000 3.000000
np.sign(dataSet.type_4_y.values).sum()
    129.0
np.sign(0.0)
    0.0
dataSet['labels'] = dataSet.type_4_y.map(lambda x: 1.0 if x > 0.0 else 0.0 )
dataSet.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 30183 entries, 0 to 30182
    Data columns (total 8 columns):
    user_id     30183 non-null int64
    item_id     30183 non-null int64
    type_1      30183 non-null float64
    type_2      30183 non-null float64
    type_3      30183 non-null float64
    type_4_x    30183 non-null float64
    type_4_y    30183 non-null float64
    labels      30183 non-null float64
    dtypes: float64(6), int64(2)
    memory usage: 2.1 MB
dataSet.head()
user_id item_id type_1 type_2 type_3 type_4_x type_4_y labels
0 59436 184081436 4.0 0.0 0.0 0.0 0.0 0.0
1 61797 83261906 3.0 0.0 0.0 0.0 0.0 0.0
2 134211 6491625 2.0 0.0 0.0 0.0 0.0 0.0
3 134211 79679783 2.0 0.0 0.0 0.0 0.0 0.0
4 134211 96616269 2.0 0.0 0.0 0.0 0.0 0.0
np.sign(dataSet.type_3.values).sum()#发生加购物车交互行为的用户商品对
    1713.0
trainSet = dataSet.copy()#重命名并保存训练数据集

trainSet.to_csv('trainSet.csv')
test_x = dataDay_load.ix['2014-12-17',:]#17号特征数据集,最为测试输入数据集
test_x.info()
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 29749 entries, 2014-12-17 to 2014-12-17
    Data columns (total 6 columns):
    user_id    29749 non-null int64
    item_id    29749 non-null int64
    type_1     29749 non-null float64
    type_2     29749 non-null float64
    type_3     29749 non-null float64
    type_4     29749 non-null float64
    dtypes: float64(4), int64(2)
    memory usage: 1.6 MB
test_x.head()
user_id item_id type_1 type_2 type_3 type_4
time_day
2014-12-17 59436 238861461 2.0 0.0 0.0 0.0
2014-12-17 60723 202829025 2.0 0.0 0.0 0.0
2014-12-17 60723 371933634 2.0 0.0 0.0 0.0
2014-12-17 106362 38830684 1.0 0.0 0.0 0.0
2014-12-17 106362 149517272 2.0 0.0 0.0 0.0
test_y = dataDay_load.ix['2014-12-18',['user_id','item_id','type_4']]#18号购买行为作为测试标签数据集
testSet = pd.merge(test_x,test_y, on = ['user_id','item_id'],suffixes=('_x','_y'), how = 'left').fillna(0.0)#构成测试数据集
testSet.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 29749 entries, 0 to 29748
    Data columns (total 7 columns):
    user_id     29749 non-null int64
    item_id     29749 non-null int64
    type_1      29749 non-null float64
    type_2      29749 non-null float64
    type_3      29749 non-null float64
    type_4_x    29749 non-null float64
    type_4_y    29749 non-null float64
    dtypes: float64(5), int64(2)
    memory usage: 1.8 MB
testSet.describe()
user_id item_id type_1 type_2 type_3 type_4_x type_4_y
count 2.974900e+04 2.974900e+04 29749.000000 29749.000000 29749.000000 29749.000000 29749.000000
mean 6.997416e+07 2.016876e+08 2.168241 0.038153 0.059296 0.024202 0.004336
std 4.685978e+07 1.170012e+08 1.334966 0.192093 0.242364 0.165070 0.069681
min 5.943600e+04 6.619000e+03 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.783149e+07 9.903570e+07 1.000000 0.000000 0.000000 0.000000 0.000000
50% 5.562218e+07 2.005868e+08 2.000000 0.000000 0.000000 0.000000 0.000000
75% 1.176616e+08 3.039699e+08 3.000000 0.000000 0.000000 0.000000 0.000000
max 1.424157e+08 4.045616e+08 17.000000 2.000000 3.000000 4.000000 3.000000
testSet['labels'] = testSet.type_4_y.map(lambda x: 1.0 if x > 0.0 else 0.0 )
testSet.describe()
user_id item_id type_1 type_2 type_3 type_4_x type_4_y labels
count 2.974900e+04 2.974900e+04 29749.000000 29749.000000 29749.000000 29749.000000 29749.000000 29749.000000
mean 6.997416e+07 2.016876e+08 2.168241 0.038153 0.059296 0.024202 0.004336 0.004101
std 4.685978e+07 1.170012e+08 1.334966 0.192093 0.242364 0.165070 0.069681 0.063909
min 5.943600e+04 6.619000e+03 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.783149e+07 9.903570e+07 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 5.562218e+07 2.005868e+08 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 1.176616e+08 3.039699e+08 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 1.424157e+08 4.045616e+08 17.000000 2.000000 3.000000 4.000000 3.000000 1.000000
testSet.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 29749 entries, 0 to 29748
    Data columns (total 8 columns):
    user_id     29749 non-null int64
    item_id     29749 non-null int64
    type_1      29749 non-null float64
    type_2      29749 non-null float64
    type_3      29749 non-null float64
    type_4_x    29749 non-null float64
    type_4_y    29749 non-null float64
    labels      29749 non-null float64
    dtypes: float64(6), int64(2)
    memory usage: 2.0 MB
testSet['labels'].values.sum()#122个购买样例
    122.0
testSet.to_csv('testSet.csv')

step7: 训练模型

逻辑回归模型

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
              penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
              verbose=0, warm_start=False)
model.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
    0.99565980850147429
train_y_est =model.predict(trainSet.ix[:,2:6])
train_y_est.sum()
    2.0

加权逻辑回归(针对类别不平衡,基于代价敏感函数)

lrW = LogisticRegression(class_weight ='auto')#针对样本不均衡问题,设置参数"class_weight
lrW.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])

trainLRW_y = lrW.predict(trainSet.ix[:,2:6])

trainLRW_y.sum()
    4792.0
lrW.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
    0.84292482523274692

计算训练精确率和召回率

from sklearn.cross_validation import train_test_split,cross_val_score
#精确率

precisions = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
                             cv = 5,scoring = 'precision')

print "精确度:\n",np.mean(precisions)
精确度: 0.0217883289288
#召回率

recalls = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
                             cv = 5,scoring = 'recall')

print "召回率:\n",np.mean(recalls)
召回率: 0.651692307692
#计算综合指标f1

f1 = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
                             cv = 5,scoring = 'f1')
print 'f1得分:\n',np.mean(f1)
    f1得分:
    0.0421179159024

计算测试f1得分

testLRW_y = lrW.predict(test_x.ix[:,2:6])
precision_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'precision')

recall_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'recall')

f1_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'f1')

print 'f1得分:\n',np.mean(f1_test)
   f1得分:
    0.0447302553442

step8:预测19号用户商品对

#构造输入数据

predict_x = dataDay_load.ix['2014-12-18',:]

predict_x.to_csv('predict_x.csv')
predict_x.info()
predict_x.describe()
user_id item_id type_1 type_2 type_3 type_4
count 2.894900e+04 2.894900e+04 28949.000000 28949.000000 28949.000000 28949.000000
mean 7.057660e+07 2.041997e+08 2.172476 0.033784 0.065080 0.026978
std 4.636984e+07 1.167057e+08 1.317706 0.181628 0.254529 0.174941
min 1.342110e+05 2.934200e+04 0.000000 0.000000 0.000000 0.000000
25% 2.902911e+07 1.037920e+08 1.000000 0.000000 0.000000 0.000000
50% 5.540191e+07 2.049606e+08 2.000000 0.000000 0.000000 0.000000
75% 1.178487e+08 3.065798e+08 3.000000 0.000000 0.000000 0.000000
max 1.424116e+08 4.045373e+08 16.000000 2.000000 3.000000 4.000000
#预测

predict_y = lrW.predict(predict_x.ix[:,2:])
predict_y.sum()#预测出共有4636个用户商品对发生购买行为
    4636.0
user_item_19 = predict_x.ix[predict_y > 0.0,['user_id','item_id']]#选出发生购买行为的用户商品对,即标签为1的,作为最后的提交结果
user_item_19.all()
   user_id    True
    item_id    True
    dtype: bool

user_item_19.info()
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 4636 entries, 2014-12-18 to 2014-12-18
    Data columns (total 2 columns):
    user_id    4636 non-null int64
    item_id    4636 non-null int64
    dtypes: int64(2)
    memory usage: 108.7 KB
user_item_19.duplicated().sum()#无重复行

    0
# 保存

user_item_19.to_csv('E:/python/gbdt/predict/tianchi_mobile_recommendation_predict.csv',index = False,encoding = 'utf-8')

其他sklearn模型的应用

GBDT模型

from sklearn.ensemble import GradientBoostingClassifier
gbdt = GradientBoostingClassifier(random_state = 10)
gbdt.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
trainGBDT_y = gbdt.predict(trainSet.ix[:,2:6])
trainGBDT_y.sum()
0.0
gbdt.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99572607096710064

随机森林

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])

trainRF_y = rf.predict(trainSet.ix[:,2:6])
rf.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99575920219991387
trainRF_y.sum()
1.0
trainRF_y
array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

svm

from sklearn import svm
svc = svm.SVC()
svc.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])

trainSVC_y = svc.predict(trainSet.ix[:,2:6])
svc.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99572607096710064
trainSVC_y.sum()
0.0

很显然样本不均衡问题对模型的影响很大

关于分类样本不均衡问题的介绍及解决办法

不平衡数据分类算法介绍与比较

集成学习以及分类样本不均衡问题

关于超参数调节

scikit-learn 梯度提升树(GBDT)调参小结

猜你喜欢

转载自blog.csdn.net/LY_ysys629/article/details/74012491
今日推荐