天池新人实战赛之[离线赛]尝试(一)

题目(https://tianchi.aliyun.com/getStart)就不贴了。经过一些百度的资料,可以将这个问题简化为:某个U-I组合在观察日是否有购买行为?(二分类问题)

接下来分几个步骤来拆解整个过程:

一.简单分析

将两个数据表.tianchi_fresh_comp_train_item和tianchi_fresh_comp_train_user存入到数据库中,

对应表名:vipfin.tianchi_fresh_comp_train_item 和vipfin.tianchi_fresh_comp_train_user

 

查看前一天的用户操作(浏览,收藏,加入购物车)对后一天的购买行为的影响程度。

参考博客https://blog.csdn.net/snoopy_yuan/article/details/72850601 他提交了一份在前一日加入购物车,在后一日未购买的数据。我们来简单验证下他的可行性。

先看加入购物车未购买的操作,以11.18为例

select 

count(1)

from (select  * from

vipfin.tianchi_fresh_comp_train_user  where  substr( time,1,10)='2014-11-18'  and behavior_type =3)  a

left  join 

(select  * from

vipfin.tianchi_fresh_comp_train_user  where  substr( time,1,10) >='2014-11-18'  and behavior_type =4) b

on a.user_id=b.user_id

and a.item_id =b.item_id

where  b.user_id is null

14998

在11.18有加入购物车,在11.19发生了购买行为的数据

select 

count(1)

from (select  * from

vipfin.tianchi_fresh_comp_train_user  where  substr( time,1,10)='2014-11-18'  and behavior_type =3)  a

inner  join 

(select  * from

vipfin.tianchi_fresh_comp_train_user  where  substr( time,1,10)='2014-11-19'  and behavior_type =4) b

on a.user_id=b.user_id

and a.item_id =b.item_id

614

前一天加入购物车的数据,第二天转换为购买行为的几率为4%。

所以博客中提高直接提交12.18日加入购物车的数据,准确率可想而知,肯定不会超过5%

 

统计前一天加入购物车这种操作的准确率都只有5%,可以想象的到浏览和收藏,转化率更低。

所以单纯的依靠前一天的操作来预测后一天购买行为是不行滴。

再进行其他的统计,单纯依靠SQL,是无法有太高的准确率的。前一天加入购物车,第二天产生购买的记录占第二天所有购买记录的比例小于10%。所以即使根据前一天加入购物车数据 统计的准确率为100%,也只占第二天总购买记录的10%不到。

综上更加坚定了需要用到机器学习了。

 

所以要考虑从tianchi_fresh_comp_train_user  每天的销售记录中,提取出一些可以衡量用户行为,购买行为,商品属性的特征,用于机器学习模型的输入。

 

二.数据预处理

几点思路:

1.由于用户行为对购买的影响随时间减弱,根据分析,用户在一周之前的行为对考察日是否购买的影响已经很小,故而只考虑距考察日(预测日)一周以内的特征数据。

2.购买行为具有一定的周期性,选取训练数据,验证数据和预测数据集(排除掉双十二的数据)

  输入 输出
训练数据 11.22~11.27U-I集合行为数据 11.28U-I集合购买记录
验证数据 11.29~12.04U-I集合行为数据 12.05 U-I集合购买记录
预测数据 12.13~12.18U-I集合行为数据 12.19 U-I集合购买记录

使用训练数据训练出模型,通过一些调参数,使模型损失函数最小,准确率较高。

再代入验证数据,预测出结果和真实12.05的数据进行比对,验证其泛化能力,如果验证结果较为理想

则直接使用预测数据进行预测

3.针对当前业务场景,根据user和item数据进行组合构建出各种维度的特征值

4.由于问题已被明确为 U-I 是否发生购买行为(标记label取{0,1])的分类问题。特征集合都要以U-I为维度构建。预测时所考虑的U-I集合。如果是笛卡尔积式的(所有用户*所有商品) 预测,数据量太大。这里优先考虑在预测日前一个周期内出现过操作的U-I组合

(这里也会存在问题,输入数据的集合太小,可以扩大到出现过操作的item类别相同的U-I组合,

更严谨一些,类别相同,并且操作最频繁的item(最受所有用户欢迎的商品) 产生的U-I组合,待后续探索)

参考https://blog.csdn.net/snoopy_yuan/article/details/75105724 简单提取几个维度的特征值

5.数据集的范围并不是一成不变的,根据预测目标,和训练数据的分布情况,可能需要对数据进行筛选等操作。

 

特征名称 所属类别 特征含义 特征作用 数量
u_b_count U 用户在考察日的前一个周期内行为总数 用户活跃度 1
u_bi_count (i=1/2/3/4) U 用户在前一个周期各种行为的计数 用户活跃度(不同操作) 4
u_b4_rate U 用户购买转换率 用户购买习惯 1
i_u_count I 商品在周期内的操作计数 商品热度 1
i_b4_rate I 商品的点击购买转化率 反映了商品的购买决策操作特点 1
c_u_count C 类别在周期内的操作计数 反映了item_category的热度 1
c_b4_rate C 类别的点击购买转化率 反映了item_category的购买决策操作特点 1
ui_b_count UI 用户-商品对在周期内的行为总数计数 反映了U-I的活跃程度 1
 uc_b_count  UC  用户-类别对在周期内的行为总数计数  反映了U-C的活跃程度  1

 

 以上特征值提取,可选择在python pandas里面完成(原博客好像是在excel中统计的),也可选择使用SQL统计。这里我用后者,因为我对SQL操作更熟悉。

SQL操作 

create table temp_fin.temp_tianchi_train1 as 
select a.user_id, a.item_id,a.item_category,1  as  flag
from 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
) a 
inner join 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) ='2014-11-28' and  behavior_type =4 ) b
on a.user_id=b.user_id
and a.item_id =b.item_id 
union all
select a.user_id, a.item_id,a.item_category,0  as  flag
from 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
) a 
left join 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) ='2014-11-28' and  behavior_type =4 ) b
on a.user_id=b.user_id
and a.item_id =b.item_id 
where b.user_id is null

create table temp_fin.temp_tianchi_train1_dist as
select   distinct  * from  temp_fin.temp_tianchi_train1
---特征提取
create table temp_fin.temp_tianchi_train1_u_b_count as 
select  distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
 group by user_id
)  b 
on a.user_id=b.user_id

create table temp_fin.temp_tianchi_train1_u_b1_count  as 
select  distinct  a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=1
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b2_count  as 
select distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=2
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b3_count  as 
select  distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=3
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b4_count  as 
select   distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b4_rate as 
select distinct a.user_id,  d.rate u_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select   b.user_id , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by user_id
)  b 
left join 
(select   user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by user_id
)  c
 on b.user_id=c.user_id
 )  d 
 on a.user_id =d.user_id

create table temp_fin.temp_tianchi_train1_i_u_count	 as 
select  distinct a.item_id,  b.l_count   i_u_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by item_id
)  b 
on a.item_id=b.item_id

create table temp_fin.temp_tianchi_train1_i_b4_rate as 
select  distinct a.item_id,  d.rate i_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select   b.item_id , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by item_id
)  b 
left join 
(select   item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by item_id
)  c
 on b.item_id=c.item_id
 )  d 
 on a.item_id =d.item_id

create table temp_fin.temp_tianchi_train1_c_u_count	 as 
select  distinct a.item_category,  b.l_count   c_u_count  from
temp_fin.temp_tianchi_train1_dist a 
inner  join
(select item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by item_category
)  b 
on a.item_category=b.item_category

create table temp_fin.temp_tianchi_train1_c_b4_rate as 
select    distinct a.item_category,  d.rate c_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
left join 
(select   b.item_category , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by item_category
)  b 
inner join 
(select   item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by item_category
)  c
 on b.item_category=c.item_category
 )  d 
 on a.item_category =d.item_category
 
 create table temp_fin.temp_tianchi_train1_ui_b_count	 as 
select   distinct a.user_id, a.item_id,  b.l_count   ui_b_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select user_id,item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by user_id,item_id
)  b 
on a.user_id=b.user_id
and a.item_id=b.item_id

create table temp_fin.temp_tianchi_train1_uc_b_count  as 
select distinct  a.user_id,a.item_category  ,b.l_count   uc_b_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select user_id,item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by user_id,item_category
)  b 
on a.user_id=b.user_id
and a.item_category=b.item_category

create table temp_fin.temp_tianchi_train1_data as 
select a.user_id, a.item_id,a.item_category
,u_b_count_table.u_b_count
,u_b1_count.u_b_count u_b1_count
,u_b2_count.u_b_count u_b2_count
,u_b3_count.u_b_count u_b3_count
,u_b4_count.u_b_count u_b4_count
,u_b4_rate.u_b4_rate
,i_u_count.i_u_count
,i_b4_rate.i_b4_rate
,c_u_count.c_u_count
,c_b4_rate.c_b4_rate
,ui_b_count.ui_b_count
,uc_b_count.uc_b_count
,a.flag
from temp_fin.temp_tianchi_train1_dist a 
left join temp_fin.temp_tianchi_train1_u_b_count u_b_count_table
on a.user_id =u_b_count_table.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b1_count  u_b1_count 
on a.user_id =u_b1_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b2_count  u_b2_count 
on a.user_id =u_b2_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b3_count  u_b3_count 
on a.user_id =u_b3_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b4_count  u_b4_count 
on a.user_id =u_b4_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b4_rate  u_b4_rate 
on a.user_id =u_b4_rate.user_Id 
left join  temp_fin.temp_tianchi_train1_i_u_count i_u_count
on a.item_id =i_u_count.item_id
left join  temp_fin.temp_tianchi_train1_i_b4_rate  i_b4_rate 
on a.item_id =i_b4_rate.item_id
left join  temp_fin.temp_tianchi_train1_c_u_count c_u_count
on a.item_category=c_u_count.item_category
left join  temp_fin.temp_tianchi_train1_c_b4_rate c_b4_rate
on a.item_category=c_b4_rate.item_category
left join  temp_fin.temp_tianchi_train1_ui_b_count ui_b_count
on a.user_id =ui_b_count.user_Id and a.item_id=ui_b_count.item_id
left join  temp_fin.temp_tianchi_train1_uc_b_count uc_b_count
on a.user_id =uc_b_count.user_Id and a.item_category=uc_b_count.item_category;

 同理算出其他两个数据集

 

三.特征处理

处理好后的数据集依然分为三份,每一份大概有这么些列

user_id,item_id,category,特征值(u_b_count...uc_b_count) , label(标签,在观察日是否购买)

有了以上数据。做特征处理,使用pyspark.ml.feature 包。该包下有多类特征转换为一个多维向量的方法,

比如VectorAssembler;也有做特征值缩放,0值处理的方法,比如MaxAbsScaler,MinMaxScaler。

特征处理的两个步骤:

多列特征值 =》 一列 多维向量  =》向量值缩放

(思考内容:第一步操作能否加入特征权重的概念?毕竟上面那么多特征维度,有些维度更加重要,比如用户活跃度比商品活跃度更加重要。用户活跃度高,才更可能买商品,如果一个爆款商品遇到一个不怎么操作的用户,也是白搭)

注:如果使用sklearn API进行模型学习,输入的特征值格式是一个array,可直接将所有特征值合并起来处理,过程略

过程代码待补充...

四.模型搭建

特征值已经处理为模型可识别的向量,直接在pyspark.ml 中找不同的算法模型,带入计算。根据准确率调整超参。并根据验证数据来验证模型的可靠性。

过程代码待补充...

 结尾:参考博客地址https://blog.csdn.net/snoopy_yuan

猜你喜欢

转载自ronaldoly.iteye.com/blog/2416029