Taobao user shopping behavior analysis

In this case, we will use Databend Cloud to analyze the Taobao user shopping behavior data set from Tianchi Lab and discover interesting shopping behaviors together.

This data set is in CSV format and contains all the behaviors (including clicks, purchases, additional purchases, and likes) of about one million random users who acted between November 25, 2017, and December 3, 2017. Each row of the data set represents a user behavior and consists of the following 5 columns, separated by commas:

Column name illustrate
User ID Integer type, serialized user ID
Product ID Integer type, serialized product ID
Product Category ID Integer type, serialized product category ID
behavior type String, enumeration type, including: 'pv': pv of product details page, equivalent to click; 'buy': purchase of product; 'cart': add product to shopping cart; 'fav': collect product
Timestamp The timestamp when the behavior occurred

Preparation

Download dataset

  1. Download the Taobao user shopping behavior data set to your local computer, and then use the following command to decompress it:
unzip UserBehavior.csv.zip
  1. Compress the decompressed data set file (UserBehavior.csv) into gzip format:
gzip UserBehavior.csv

Create an external stage

  1. Log in to Databend Cloud and create a new workspace.
  2. In the workspace, execute the following SQL statement to create an external Stage named "mycsv" on Alibaba Cloud:
CREATE STAGE mycsv URL = 's3://<YOUR_BUCKET_NAME>'
CONNECTION = (
  ACCESS_KEY_ID = '<YOUR_ACCESS_KEY_ID>',
  SECRET_ACCESS_KEY = '<YOUR_SECRET_ACCESS_KEY>',
  ENDPOINT_URL = '<YOUR_ENDPOINT_URL>',
  ENABLE_VIRTUAL_HOST_STYLE = TRUE
)
FILE_FORMAT = (
  TYPE = CSV
  COMPRESSION = AUTO
);
  1. Execute the following SQL statement to verify whether Databend Cloud can access the external Stage:
LIST @mycsv;

Upload dataset to external Stage

Use  BendSQL to upload the compressed data set file (UserBehavior.csv.gz) to the external Stage. To obtain the connection information of the computing cluster, please refer to Connecting to the Computing Cluster .

(base) eric@Erics-iMac ~ % bendsql --host tenantID--YOUR_WAREHOUSE.gw.aliyun-cn-beijing.default.databend.cn \
  --user=cloudapp \
  --password=<YOUR_PASSWORD> \
  --database="default" \
  --port=443 --tls
Welcome to BendSQL 0.9.3-db6b232(2023-10-26T12:36:55.578667000Z).
Connecting to tenantID--YOUR_WAREHOUSE.gw.aliyun-cn-beijing.default.databend.cn:443 as user cloudapp.
Connected to DatabendQuery v1.2.183-nightly-1ed9a826ed(rust-1.72.0-nightly-2023-10-28T22:10:15.618365223Z)

cloudapp@tenantID--YOUR_WAREHOUSE.gw.aliyun-cn-beijing.default.databend.cn:443/default> PUT fs:///Users/eric/Documents/UserBehavior.csv.gz @mycsv

PUT fs:///Users/eric/Documents/UserBehavior.csv.gz @mycsv

┌─────────────────────────────────────────────────────────────────┐
│                    file                   │  status │    size   │
│                   String                  │  String │   UInt64  │
├───────────────────────────────────────────┼─────────┼───────────┤
│ /Users/eric/Documents/UserBehavior.csv.gz │ SUCCESS │ 949805035 │
└─────────────────────────────────────────────────────────────────┘
1 file uploaded in 401.807 sec. Processed 1 file, 905.80 MiB (0.00 file/s, 2.25 MiB/s)

Data import and cleaning

Create table

In the workspace, execute the following SQL statement to create a table for the dataset:

CREATE TABLE `user_behavior` (
  `user_id` INT NOT NULL,
  `item_id` INT NOT NULL,
  `category_id` INT NOT NULL,
  `behavior_type` VARCHAR,
  `ts` TIMESTAMP,
  `day` DATE );

Clean and import data

  1. Execute the following SQL statement to import data into the table and complete cleaning at the same time:

    • Remove invalid data outside the time zone
    • Data deduplication
    • Generate additional columns of data
INSERT INTO user_behavior
SELECT $1,$2,$3,$4,to_timestamp($5::bigint) AS ts, to_date(ts) day
FROM @mycsv/UserBehavior.csv.gz WHERE day BETWEEN '2017-11-25' AND '2017-12-03'
GROUP BY $1,$2,$3,$4,ts;
  1. Execute the following SQL statement to verify whether the data import is successful. This statement will return 10 rows of data from the table.
SELECT * FROM user_behavior LIMIT 10;

data analysis

After completing the preliminary preparation and data import, we officially started data analysis.

User traffic and shopping situation analysis

Total visits and number of users

SELECT SUM(CASE WHEN behavior_type = 'pv' THEN 1 ELSE 0 END) as pv,
COUNT(DISTINCT user_id) as uv
FROM user_behavior;

Average daily visits and users

SELECT day,
       SUM(CASE WHEN behavior_type = 'pv' THEN 1 ELSE 0 END) AS pv,
       COUNT(DISTINCT user_id) AS uv
FROM user_behavior
GROUP BY day
ORDER BY day;

You can also generate a line chart by  using the dashboard  function:

Count each user's shopping situation and generate a new table: user_behavior_count

create table user_behavior_count as select user_id,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
       from user_behavior
group by user_id;

Repurchase rate: the proportion of users who have purchased twice or more than those who have purchased

select sum(case when buy > 1 then 1 else 0 end) / sum(case when buy > 0 then 1 else 0 end)
from user_behavior_count;

User behavior conversion rate

Click/(Add to Cart + Collection)/Purchase, conversion rate of each link

select a.pv,
       a.fav,
       a.cart,
       a.fav + a.cart as `fav+cart`,
       a.buy,
       round((a.fav + a.cart) / a.pv, 4) as pv2favcart,
       round(a.buy / (a.fav + a.cart), 4) as favcart2buy,
       round(a.buy / a.pv, 4) as pv2buy
from(
select sum(pv) as pv,   --点击数
sum(fav) as fav,  --收藏数
sum(cart) as cart,  --加购物车数
sum(buy) as buy  --购买数
from user_behavior_count
) as a;

Count the users who completed browsing -> adding to shopping -> and paying in one hour

SELECT
   count_if(level>=1) as pv, count_if(level>=2) as cart, count_if(level>=3) as buy
FROM
(
    SELECT
        user_id,
        window_funnel(3600000000)(ts, behavior_type = 'pv',behavior_type = 'cart',behavior_type = 'buy') AS level
    FROM user_behavior
    GROUP BY user_id
);

User behavior habits

Daily user shopping behavior

select to_hour(ts) as hour,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior
group by hour
order by hour;

You can also generate a line chart by  using the dashboard  function:

Weekly user shopping behavior

select to_day_of_week(day) as weekday,day,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior
where day between '2017-11-27' and '2017-12-03'
group by weekday,day
order by weekday;

You can also generate a histogram by  using the dashboard  function:

Find valuable users based on RFM model

The RFM model is an important tool and means to measure customer value and customer profit-making ability. Three elements constitute the best indicators for data analysis:

  • R-Recency (last purchase time)
  • F-Frequency (consumption frequency)
  • M-Money (consumption amount)

R-Recency (recent purchase time): The higher the R value, the more active the user is

select user_id,
       to_date('2017-12-04') - max(day) as R,
       dense_rank() over(order by (to_date('2017-12-04') - max(day))) as R_rank
from user_behavior
where behavior_type = 'buy'
group by user_id
limit 10;

F-Frequency (Consumption Frequency): The higher the F value, the more loyal the user is

select user_id,
       count(1) as F,
       dense_rank() over(order by count(1) desc) as F_rank
from user_behavior
where behavior_type = 'buy'
group by user_id
limit 10;

User grouping

Users with purchasing behavior are grouped according to their rankings and divided into 5 groups:

  • Top 1/5 users give 5 points
  • Top 1/5 - 2/5 users give 4 points
  • Top 2/5 - 3/5 users give 3 points
  • The first 3/5 - 4/5 users give 2 points
  • The rest of the users give it a score of 1

According to this rule, the user's time interval ranking and purchase frequency ranking are scored respectively, and finally the two scores are combined together as the final score of the user.

with cte as(
select user_id,
       to_date('2017-12-04') - max(day) as R,
       dense_rank() over(order by (to_date('2017-12-04') - max(day))) as R_rank,
       count(1) as F,
       dense_rank() over(order by count(1) desc) as F_rank
from user_behavior
where behavior_type = 'buy'
group by user_id)
select user_id, R, R_rank, R_score, F, F_rank, F_score,  R_score + F_score AS score
from(
select *,
       case ntile(5) over(order by R_rank) when 1 then 5
                                           when 2 then 4
                                           when 3 then 3
                                           when 4 then 2
                                           when 5 then 1
       end as R_score,
       case ntile(5) over(order by F_rank) when 1 then 5
                                           when 2 then 4
                                           when 3 then 3
                                           when 4 then 2
                                           when 5 then 1
       end as F_score
from cte
) as a
order by score desc
limit 20;

Product dimension analysis

Top selling items

select item_id ,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior
group by item_id
order by buy desc
limit 10;

Top selling product categories

select category_id ,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior
group by category_id
order by buy desc
limit 10;

User retention analysis

Before starting, create the table "day_users" and insert data:

create table day_users(
day date,
users bitmap);

insert into day_users select day, build_bitmap(list(user_id::UInt64)) from user_behavior group by day;

Statistics of daily UV

select day,bitmap_count(users) from day_users order by day;

relative retention

Here we calculate users who are still using Taobao on December 2 compared to November 23:

select bitmap_count(bitmap_and(a.users, b.users))
from (select users from day_users where day='2017-11-25') a ,
(select users from day_users where day='2017-12-02') b;

relatively new

select bitmap_count(bitmap_not(b.users, a.users)) from (select users from day_users where day='2017-11-25') a ,
(select users from day_users where day='2017-12-02') b;

Linus took matters into his own hands to prevent kernel developers from replacing tabs with spaces. His father is one of the few leaders who can write code, his second son is the director of the open source technology department, and his youngest son is a core contributor to open source. Huawei: It took 1 year to convert 5,000 commonly used mobile applications Comprehensive migration to Hongmeng Java is the language most prone to third-party vulnerabilities. Wang Chenglu, the father of Hongmeng: open source Hongmeng is the only architectural innovation in the field of basic software in China. Ma Huateng and Zhou Hongyi shake hands to "remove grudges." Former Microsoft developer: Windows 11 performance is "ridiculously bad " " Although what Laoxiangji is open source is not the code, the reasons behind it are very heartwarming. Meta Llama 3 is officially released. Google announces a large-scale restructuring
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5489811/blog/11045354