[Offline Competition] Attempt in Tianchi Rookie Combat Competition (1)

The title (https://tianchi.aliyun.com/getStart) will not be posted. After some Baidu information, this question can be simplified to: Does a certain UI combination have purchase behavior on the observation day? (two classification problem)

The following steps break down the whole process:

1. Simple analysis

Store the two data tables.tianchi_fresh_comp_train_item and tianchi_fresh_comp_train_user into the database,

Corresponding table names: vipfin.tianchi_fresh_comp_train_item and vipfin.tianchi_fresh_comp_train_user

 

Check the impact of the previous day's user actions (browse, favorite, add to the shopping cart) on the purchase behavior of the next day.

Refer to the blog https://blog.csdn.net/snoopy_yuan/article/details/72850601 He submitted a piece of data that was added to the shopping cart the day before and not purchased the day after. Let's simply verify its feasibility.

Let's first look at the operation of adding to the shopping cart but not purchasing, taking 11.18 as an example

select 

count(1)

from (select  * from

vipfin.tianchi_fresh_comp_train_user  where  substr( time,1,10)='2014-11-18'  and behavior_type =3)  a

left  join 

(select  * from

vipfin.tianchi_fresh_comp_train_user  where  substr( time,1,10) >='2014-11-18'  and behavior_type =4) b

on a.user_id=b.user_id

and a.item_id =b.item_id

where  b.user_id is null

14998

Data of adding to shopping cart on 11.18 and purchasing behavior on 11.19

select 

count(1)

from (select  * from

vipfin.tianchi_fresh_comp_train_user  where  substr( time,1,10)='2014-11-18'  and behavior_type =3)  a

inner  join 

(select  * from

vipfin.tianchi_fresh_comp_train_user  where  substr( time,1,10)='2014-11-19'  and behavior_type =4) b

on a.user_id=b.user_id

and a.item_id =b.item_id

614

前一天加入购物车的数据,第二天转换为购买行为的几率为4%。

所以博客中提高直接提交12.18日加入购物车的数据,准确率可想而知,肯定不会超过5%

 

统计前一天加入购物车这种操作的准确率都只有5%,可以想象的到浏览和收藏,转化率更低。

所以单纯的依靠前一天的操作来预测后一天购买行为是不行滴。

再进行其他的统计,单纯依靠SQL,是无法有太高的准确率的。前一天加入购物车,第二天产生购买的记录占第二天所有购买记录的比例小于10%。所以即使根据前一天加入购物车数据 统计的准确率为100%,也只占第二天总购买记录的10%不到。

综上更加坚定了需要用到机器学习了。

 

所以要考虑从tianchi_fresh_comp_train_user  每天的销售记录中,提取出一些可以衡量用户行为,购买行为,商品属性的特征,用于机器学习模型的输入。

 

二.数据预处理

几点思路:

1.由于用户行为对购买的影响随时间减弱,根据分析,用户在一周之前的行为对考察日是否购买的影响已经很小,故而只考虑距考察日(预测日)一周以内的特征数据。

2.购买行为具有一定的周期性,选取训练数据,验证数据和预测数据集(排除掉双十二的数据)

  输入 输出
训练数据 11.22~11.27U-I集合行为数据 11.28U-I集合购买记录
验证数据 11.29~12.04U-I集合行为数据 12.05 U-I集合购买记录
预测数据 12.13~12.18U-I集合行为数据 12.19 U-I集合购买记录

使用训练数据训练出模型,通过一些调参数,使模型损失函数最小,准确率较高。

再代入验证数据,预测出结果和真实12.05的数据进行比对,验证其泛化能力,如果验证结果较为理想

则直接使用预测数据进行预测

3.针对当前业务场景,根据user和item数据进行组合构建出各种维度的特征值

4.由于问题已被明确为 U-I 是否发生购买行为(标记label取{0,1])的分类问题。特征集合都要以U-I为维度构建。预测时所考虑的U-I集合。如果是笛卡尔积式的(所有用户*所有商品) 预测,数据量太大。这里优先考虑在预测日前一个周期内出现过操作的U-I组合

(这里也会存在问题,输入数据的集合太小,可以扩大到出现过操作的item类别相同的U-I组合,

更严谨一些,类别相同,并且操作最频繁的item(最受所有用户欢迎的商品) 产生的U-I组合,待后续探索)

参考https://blog.csdn.net/snoopy_yuan/article/details/75105724 简单提取几个维度的特征值

5.数据集的范围并不是一成不变的,根据预测目标,和训练数据的分布情况,可能需要对数据进行筛选等操作。

 

特征名称 所属类别 特征含义 特征作用 数量
u_b_count U 用户在考察日的前一个周期内行为总数 用户活跃度 1
u_bi_count (i=1/2/3/4) U 用户在前一个周期各种行为的计数 用户活跃度(不同操作) 4
u_b4_rate U 用户购买转换率 用户购买习惯 1
i_u_count I 商品在周期内的操作计数 商品热度 1
i_b4_rate I 商品的点击购买转化率 反映了商品的购买决策操作特点 1
c_u_count C 类别在周期内的操作计数 反映了item_category的热度 1
c_b4_rate C 类别的点击购买转化率 反映了item_category的购买决策操作特点 1
ui_b_count UI 用户-商品对在周期内的行为总数计数 反映了U-I的活跃程度 1
 uc_b_count  UC  用户-类别对在周期内的行为总数计数  反映了U-C的活跃程度  1

 

 以上特征值提取,可选择在python pandas里面完成(原博客好像是在excel中统计的),也可选择使用SQL统计。这里我用后者,因为我对SQL操作更熟悉。

SQL操作 

create table temp_fin.temp_tianchi_train1 as 
select a.user_id, a.item_id,a.item_category,1  as  flag
from 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
) a 
inner join 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) ='2014-11-28' and  behavior_type =4 ) b
on a.user_id=b.user_id
and a.item_id =b.item_id 
union all
select a.user_id, a.item_id,a.item_category,0  as  flag
from 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
) a 
left join 
(select *
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) ='2014-11-28' and  behavior_type =4 ) b
on a.user_id=b.user_id
and a.item_id =b.item_id 
where b.user_id is null

create table temp_fin.temp_tianchi_train1_dist as
select   distinct  * from  temp_fin.temp_tianchi_train1
---特征提取
create table temp_fin.temp_tianchi_train1_u_b_count as 
select  distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27' 
 group by user_id
)  b 
on a.user_id=b.user_id

create table temp_fin.temp_tianchi_train1_u_b1_count  as 
select  distinct  a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=1
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b2_count  as 
select distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=2
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b3_count  as 
select  distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=3
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b4_count  as 
select   distinct a.user_id,b.l_count u_b_count from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
 group by user_id
)  b 
on a.user_id=b.user_id;

create table temp_fin.temp_tianchi_train1_u_b4_rate as 
select distinct a.user_id,  d.rate u_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select   b.user_id , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by user_id
)  b 
left join 
(select   user_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by user_id
)  c
 on b.user_id=c.user_id
 )  d 
 on a.user_id =d.user_id

create table temp_fin.temp_tianchi_train1_i_u_count	 as 
select  distinct a.item_id,  b.l_count   i_u_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by item_id
)  b 
on a.item_id=b.item_id

create table temp_fin.temp_tianchi_train1_i_b4_rate as 
select  distinct a.item_id,  d.rate i_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
inner join 
(select   b.item_id , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by item_id
)  b 
left join 
(select   item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by item_id
)  c
 on b.item_id=c.item_id
 )  d 
 on a.item_id =d.item_id

create table temp_fin.temp_tianchi_train1_c_u_count	 as 
select  distinct a.item_category,  b.l_count   c_u_count  from
temp_fin.temp_tianchi_train1_dist a 
inner  join
(select item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by item_category
)  b 
on a.item_category=b.item_category

create table temp_fin.temp_tianchi_train1_c_b4_rate as 
select    distinct a.item_category,  d.rate c_b4_rate  from
temp_fin.temp_tianchi_train1_dist a 
left join 
(select   b.item_category , cast(COALESCE(c.l_count,0)  as double)/b.l_count  rate      from 
(select item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type in (1,2,3,4)
 group by item_category
)  b 
inner join 
(select   item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  and behavior_type=4
  group by item_category
)  c
 on b.item_category=c.item_category
 )  d 
 on a.item_category =d.item_category
 
 create table temp_fin.temp_tianchi_train1_ui_b_count	 as 
select   distinct a.user_id, a.item_id,  b.l_count   ui_b_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select user_id,item_id,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by user_id,item_id
)  b 
on a.user_id=b.user_id
and a.item_id=b.item_id

create table temp_fin.temp_tianchi_train1_uc_b_count  as 
select distinct  a.user_id,a.item_category  ,b.l_count   uc_b_count  from
temp_fin.temp_tianchi_train1_dist a 
inner join
(select user_id,item_category,count(1) l_count
from vipfin.tianchi_fresh_comp_train_user where substr( time,1,10) >='2014-11-22'  and substr( time,1,10) <='2014-11-27'  
 group by user_id,item_category
)  b 
on a.user_id=b.user_id
and a.item_category=b.item_category

create table temp_fin.temp_tianchi_train1_data as 
select a.user_id, a.item_id,a.item_category
,u_b_count_table.u_b_count
,u_b1_count.u_b_count u_b1_count
,u_b2_count.u_b_count u_b2_count
,u_b3_count.u_b_count u_b3_count
,u_b4_count.u_b_count u_b4_count
,u_b4_rate.u_b4_rate
,i_u_count.i_u_count
,i_b4_rate.i_b4_rate
,c_u_count.c_u_count
,c_b4_rate.c_b4_rate
,ui_b_count.ui_b_count
,uc_b_count.uc_b_count
,a.flag
from temp_fin.temp_tianchi_train1_dist a 
left join temp_fin.temp_tianchi_train1_u_b_count u_b_count_table
on a.user_id =u_b_count_table.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b1_count  u_b1_count 
on a.user_id =u_b1_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b2_count  u_b2_count 
on a.user_id =u_b2_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b3_count  u_b3_count 
on a.user_id =u_b3_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b4_count  u_b4_count 
on a.user_id =u_b4_count.user_Id 
left join  temp_fin.temp_tianchi_train1_u_b4_rate  u_b4_rate 
on a.user_id =u_b4_rate.user_Id 
left join  temp_fin.temp_tianchi_train1_i_u_count i_u_count
on a.item_id =i_u_count.item_id
left join  temp_fin.temp_tianchi_train1_i_b4_rate  i_b4_rate 
on a.item_id =i_b4_rate.item_id
left join  temp_fin.temp_tianchi_train1_c_u_count c_u_count
on a.item_category=c_u_count.item_category
left join  temp_fin.temp_tianchi_train1_c_b4_rate c_b4_rate
on a.item_category=c_b4_rate.item_category
left join  temp_fin.temp_tianchi_train1_ui_b_count ui_b_count
on a.user_id =ui_b_count.user_Id and a.item_id=ui_b_count.item_id
left join  temp_fin.temp_tianchi_train1_uc_b_count uc_b_count
on a.user_id =uc_b_count.user_Id and a.item_category=uc_b_count.item_category;

 同理算出其他两个数据集

 

三.特征处理

处理好后的数据集依然分为三份,每一份大概有这么些列

user_id,item_id,category,特征值(u_b_count...uc_b_count) , label(标签,在观察日是否购买)

有了以上数据。做特征处理,使用pyspark.ml.feature 包。该包下有多类特征转换为一个多维向量的方法,

比如VectorAssembler;也有做特征值缩放,0值处理的方法,比如MaxAbsScaler,MinMaxScaler。

特征处理的两个步骤:

多列特征值 =》 一列 多维向量  =》向量值缩放

(思考内容:第一步操作能否加入特征权重的概念?毕竟上面那么多特征维度,有些维度更加重要,比如用户活跃度比商品活跃度更加重要。用户活跃度高,才更可能买商品,如果一个爆款商品遇到一个不怎么操作的用户,也是白搭)

注:如果使用sklearn API进行模型学习,输入的特征值格式是一个array,可直接将所有特征值合并起来处理,过程略

过程代码待补充...

 

四.模型搭建

特征值已经处理为模型可识别的向量,直接在pyspark.ml 中找不同的算法模型,带入计算。根据准确率调整超参。并根据验证数据来验证模型的可靠性。

过程代码待补充...

 

 

 结尾:参考博客地址https://blog.csdn.net/snoopy_yuan

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326058782&siteId=291194637