Python graduation design big data e-commerce user behavior analysis and visualization


1. Dataset description

This is a piece of user behavior data from Taobao. The time interval is from 2017-11-25 to 2017-12-03, with a total of 100,150,807 records, a size of 3.5 G, and 5 fields.

Topic selection guidance, project sharing:

https://gitee.com/yaa-dc/warehouse-1/blob/master/python/README.md

2. Data processing

2.1 Data import

Load the data into hive, and then perform data processing on the data through hive.

-- 建表
drop table if exists user_behavior;
create table user_behavior (
`user_id` string comment '用户ID',
`item_id` string comment '商品ID',
`category_id` string comment '商品类目ID',
`behavior_type` string  comment '行为类型,枚举类型,包括(pv, buy, cart, fav)',
`timestamp` int comment '行为时间戳',
`datetime` string comment '行为时间')
row format delimited
fields terminated by ','
lines terminated by '\n';

-- 加载数据
LOAD DATA LOCAL INPATH '/home/getway/UserBehavior.csv'
OVERWRITE INTO TABLE user_behavior ;

2.2 Data cleaning

Data processing mainly includes: deleting duplicate values, formatting timestamps, and deleting outliers.

--数据清洗,去掉完全重复的数据
insert overwrite table user_behavior
select user_id, item_id, category_id, behavior_type, timestamp, datetime
from user_behavior
group by user_id, item_id, category_id, behavior_type, timestamp, datetime;

--数据清洗,时间戳格式化成 datetime
insert overwrite table user_behavior
select user_id, item_id, category_id, behavior_type, timestamp, from_unixtime(timestamp, 'yyyy-MM-dd HH:mm:ss')
from user_behavior;

--查看时间是否有异常值
select date(datetime) as day from user_behavior group by date(datetime) order by day;

--数据清洗,去掉时间异常的数据
insert overwrite table user_behavior
select user_id, item_id, category_id, behavior_type, timestamp, datetime
from user_behavior
where cast(datetime as date) between '2017-11-25' and '2017-12-03';

--查看 behavior_type 是否有异常值
select behavior_type from user_behavior group by behavior_type;

3. Data Analysis Visualization

3.1 User traffic and shopping situation

--总访问量PV,总用户量UV
select sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,
       count(distinct user_id) as uv
from user_behavior;

image-20201228145436838

--日均访问量,日均用户量
select cast(datetime as date) as day,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,
       count(distinct user_id) as uv
from user_behavior
group by cast(datetime as date)
order by day;

image-20201228151058279

image-20201228151535393

--每个用户的购物情况,加工到 user_behavior_count
create table user_behavior_count as
select user_id,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior
group by user_id;

--复购率:产生两次或两次以上购买的用户占购买用户的比例
select sum(case when buy > 1 then 1 else 0 end) / sum(case when buy > 0 then 1 else 0 end)
from user_behavior_count;

image-20201228152004432

  • Summary: During the period from 2017-11-25 to 2017-12-03, the total number of PVs is 89,660,671, and the total number of UVs is 987,991. Judging from the trend of average daily visits, there is a relatively obvious increase after entering December. It is guessed that it may be due to the approaching of Double 12, and the drainage of e-commerce activities. In addition, 2017-12-02 and 2017-12-03 happened to be weekends , or user activity on weekends is higher than usual. The overall repurchase rate is 66.01%, indicating that users have relatively high loyalty.

3.2 User Behavior Conversion Rate

--点击/(加购物车+收藏)/购买 , 各环节转化率
select a.pv,
       a.fav,
       a.cart,
       a.fav + a.cart as `fav+cart`,
       a.buy,
       round((a.fav + a.cart) / a.pv, 4) as pv2favcart,
       round(a.buy / (a.fav + a.cart), 4) as favcart2buy,
       round(a.buy / a.pv, 4) as pv2buy
from(
select sum(pv) as pv,   --点击数
       sum(fav) as fav,  --收藏数
       sum(cart) as cart,  --加购物车数
       sum(buy) as buy  --购买数
from user_behavior_count
) as a;

image-20201228144958757

image-20201228144814773

  • Summary: During the period from 2017-11-25 to 2017-12-03, the number of hits was 89,660,671, the number of favorites was 2,888,258, the number of added shopping carts was 5,530,446, and the number of purchases was 2,015,807. The overall conversion rate is 2.25%. This value may be relatively low. Judging from the number of items added to the shopping cart, it is possible that some users are planning to wait until the e-commerce festival to make purchases. Therefore, it is reasonable to infer that the conversion rate of general e-commerce festivals will be lower than usual.

3.3 User Behavior Habits

-- 一天的活跃时段分布
select hour(datetime) as hour,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior
group by hour(datetime)
order by hour;

image-20201228153206947

--一周用户的活跃分布
select pmod(datediff(datetime, '1920-01-01') - 3, 7) as weekday,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior
where date(datetime) between '2017-11-27' and '2017-12-03'
group by pmod(datediff(datetime, '1920-01-01') - 3, 7)
order by weekday;

image-20201228153751943

image-20201228154533968

  • Summary: Between 21:00 and 22:00 in the evening is the most active time of the day for users, and 4:00 in the morning is the time when the user is least active. During the week, the activity level is about the same on weekdays, and the activity level increases significantly on weekends.

3.4 Find valuable users based on RFM model

The RFM model is an important tool and means to measure customer value and customer profitability. Among them, three elements constitute the best indicators for data analysis, namely:

  • R-Recency (last purchase time)
  • F-Frequency (consumption frequency)
  • M-Money (consumption amount)
--R-Recency(最近一次购买时间), R值越高,一般说明用户比较活跃
select user_id,
       datediff('2017-12-04', max(datetime)) as R,
       dense_rank() over(order by datediff('2017-12-04', max(datetime))) as R_rank
from user_behavior
where behavior_type = 'buy'
group by user_id
limit 10;

--F-Frequency(消费频率), F值越高,说明用户越忠诚
select user_id,
       count(1) as F,
       dense_rank() over(order by count(1) desc) as F_rank
from user_behavior
where behavior_type = 'buy'
group by user_id
limit 10;

--M-Money(消费金额),数据集无金额,所以就不分析这一项 

Group the users with purchasing behavior according to their rankings, and divide them into 5 groups in total. The
top 1/5 users will score 5 points , the top
1/5 - 2/5 users will score 4 points,
and the top 2/5 - 3/5 users will score 4 points. Score 3 points
, top 3/5 - 4/5 users score 2 points,
top 4/5 - users score 1 point
Follow this rule to score user time interval rankings and purchase frequency rankings, and finally combine the two scores together as the user's final rating

with cte as(
select user_id,
       datediff('2017-12-04', max(datetime)) as R,
       dense_rank() over(order by datediff('2017-12-04', max(datetime))) as R_rank,
       count(1) as F,
       dense_rank() over(order by count(1) desc) as F_rank
from user_behavior
where behavior_type = 'buy'
group by user_id)

select user_id, R, R_rank, R_score, F, F_rank, F_score,  R_score + F_score AS score
from(
select *,
       case ntile(5) over(order by R_rank) when 1 then 5
                                           when 2 then 4
                                           when 3 then 3
                                           when 4 then 2
                                           when 5 then 1
       end as R_score,
       case ntile(5) over(order by F_rank) when 1 then 5
                                           when 2 then 4
                                           when 3 then 3
                                           when 4 then 2
                                           when 5 then 1
       end as F_score
from cte
) as a
order by score desc
limit 20;

image-20201228155700646

  • Summary: Personalized marketing recommendations can be made based on the user's value score.

3.5 Analysis of Commodity Dimensions

--销量最高的商品
select item_id ,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior
group by item_id
order by buy desc
limit 10;

--销量最高的商品大类
select category_id ,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior
group by category_id
order by buy desc
limit 10;
  • Summary: The product dimension table is missing, so there is not much analysis value. If there is a commodity dimension table, it can be expanded and analyzed by commodity latitude, such as the conversion rate of different industries and different products, as well as competitive product analysis, etc.

Topic selection guidance, project sharing:

https://gitee.com/yaa-dc/warehouse-1/blob/master/python/README.md

Guess you like

Origin blog.csdn.net/kooerr/article/details/130099445