Dry goods丨How to use time series database for Taobao user behavior analysis

Taobao goes deep into thousands of households. Every Taobao user can't help but open a mobile app or website every day to see their favorite products or products recommended by the app. Every day we face the soul torture of whether to chop our hands. This article will teach you how to use a time series database to analyze Taobao's user behavior. Let's take the DolphinDB database as an example. DolphinDB is a new generation of high-performance distributed time-series database with rich data analysis and distributed computing functions. This tutorial uses DolphinDB to analyze the user behavior data of Taobao APP and further analyze business problems.

Data source: User Behavior Data from Taobao for Recommendation-Data Set-Alibaba Cloud Tianchi

In this tutorial, we encapsulate DolphinDB and the data set used in docker. DolphinDB's distributed database dfs://user_behavior is included in docker. It contains a table user, which saves the behavior records of nearly one million Taobao APP users between November 25, 2017 and December 3, 2017. We adopt a combined partitioning method. The first layer is partitioned by date, one partition per day, and the second layer is hash partitioned by userID, which is divided into 180 partitions. The structure of the user table is as follows:

The meanings of various user behavior types are as follows:

  • pv: browse product detail page
  • buy: Commodity purchase
  • cart: add the product to the shopping cart
  • fav: Favorites

1. Download the docker deployment package

This tutorial has encapsulated DolphinDB and the data used in a docker container. Make sure that the docker environment has been deployed before use. Please refer to https://docs.docker.com/install/ for docker installation tutorial . Download the deployment package from http://www.dolphindb.cn/downloads/bigdata.tar.gz , and execute the following code in the directory where the deployment package is located.

Unzip the deployment package:

gunzip bigdata.tar.gz

Import the container snapshot as a mirror:

cat bigdata.tar | docker import - my/bigdata:v1

Get the ID of the mirror my/bigdata:v1:

docker images

Start the container (replace the images id according to the actual situation):

docker run -dt -p 8888:8848 --name test <image id> /bin/bash ./dolphindb/start.sh

Enter the IP address of the machine: 8888 in the address bar of the browser, such as localhost:8888, and enter DolphinDB Notebook. The following codes are all executed in DolphinDB Notebook.

The DolphinDB license in the docker is valid until September 1, 2019. If the license file expires, you only need to download the community version from the DolphinDB official website , and replace bigdata.tar/dolphindb/dolphindb.lic with the community version license.

2. User Behavior Analysis

View data volume:

login("admin","123456")
user=loadTable("dfs://user_behavior","user")
select count(*) from user
98914533

There are 98,914,533 records in the user table.

Analyze the behavior of users from browsing to final purchase of goods:

PV=exec count(*) from user where behavior="pv"
88596903
UV=count(exec distinct userID from user)
987984

In these 9 days, the page views of Taobao APP were 88,596,903, and the number of unique visitors was 987,984.

The exec used above is a unique function of DolphinDB, which is similar to select. The difference between the two is that the select statement always returns a table. When exec selects a column, it returns a vector. When used with aggregate functions, it returns a scalar. When used with pivoy by, it returns a matrix to facilitate subsequent data analysis. Calculation.

Count the number of users who only browsed the page once:

onceUserNum=count(select count(behavior) from user group by userID having count(behavior)=1)
92
jumpRate=onceUserNum\UV*100
0.009312

Only 92 users left the APP after browsing only one page, accounting for 0.0093% of the total number of users, which is almost negligible, indicating that Taobao is attractive enough to keep users in the APP.

Count the number of individual user behaviors:

behaviors=select count(*) as num from user group by behavior

Calculate the conversion rate from browsing to intentional purchase:

Adding products to the shopping cart and collecting products can be considered that the user intends to buy. Count the number of user behaviors with intention to purchase:

fav_cart=exec sum(num) from behaviors where behavior="fav" or behavior="cart"
8318654
intentRate=fav_cart\PV*100
9.389328

The conversion rate from browsing to intentional purchase is only 9.38%.

buy=(exec num from behaviors where behavior="buy")[0]
1998976
buyRate=buy\PV*100
2.256259
intent_buy=buy\fav_cart*100
24.030041

The conversion rate from browsing to final purchase is only 2.25%, and the conversion rate from intentional purchase to final purchase is 24.03%, indicating that most users will bookmark or add favorite products to the shopping cart, but may not purchase them immediately.

Statistics of independent visitors of various user behaviors:

userNums=select count(userID) as num from (select count(*) from user group by behavior,userID) group by behavior

pay_user_rate=(exec num from userNums where behavior="buy")[0]\UV*100
67.852313

In these 9 days, 67.8% of paying users used Taobao APP, indicating that most users will shop on Taobao APP.

Count the number of users with various user behaviors every day:

dailyUserNums=select sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user group by date(behaveTime) as date

On Friday, Saturday and Sunday (2017.11.25, 2017.11.26, 2017.12.02, 2017.12.03) Taobao APP visits increased significantly.

iif is a conditional operator of DolphinDB. Its syntax is iif(cond, trueResult, falseResult). cond is usually a boolean expression. If cond is satisfied, trueResult is returned. If cond is not satisfied, falseResult is returned.

Respectively count the number of various user behaviors at different time periods each day. We provide the following two methods:

The first method is to separately count the data of each time period, and then combine the results. For example, count the number of user behaviors in different time periods on the working day 2017.11.29 (Wednesday).

re1=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T00:00:00 : 2017.11.29T05:59:59

re2=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T06:00:00 : 2017.11.29T08:59:59

re3=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T09:00:00 : 2017.11.29T11:59:59

re4=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T12:00:00 : 2017.11.29T13:59:59

re5=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T14:00:00 : 2017.11.29T17:59:59

re6=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T18:00:00 : 2017.11.29T21:59:59

re7=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T22:00:00 : 2017.11.29T23:59:59

re=unionAll([re1,re2,re3,re4,re5,re6,re7],false)

This method is relatively simple, but requires a lot of repetitive code. Of course, the repeated code can also be encapsulated into a function.

def calculateBehavior(startTime,endTime){
    return select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between startTime : endTime
}

In this way, you only need to specify the start time of the time period.

The other method is through the Map-Reduce framework of DolphinDB. For example, the statistics of user behavior on weekdays 2017.11.29 (Wednesday).

def caculate(t){
	return select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from t	
}
ds1 = repartitionDS(<select * from user>, `behaveTime, RANGE,2017.11.29T00:00:00 2017.11.29T06:00:000 2017.11.29T09:00:00 2017.11.29T12:00:00 2017.11.29T14:00:00 2017.11.29T18:00:00 2017.11.29T22:00:00 2017.11.29T23:59:59)
WedBehavior = mr(ds1, caculate, , unionAll{, false})

We use the repartitionDS function to partition the user table according to the time range (without changing the original partitioning method of the user table), and generate multiple data sources, and then use the mr function to perform parallel calculations on the data sources. DolphinDB will apply the caculate function to each data source, and then merge the results.

On weekdays, the Taobao APP has the highest usage rate in the early morning (0 am to 6 am), followed by the afternoon (14 am to 16:00).

Statistics of user behavior on Saturday (2017.11.25) and Sunday (2017.11.26):

ds2 = repartitionDS(<select * from user>, `behaveTime, RANGE,2017.11.25T00:00:00 2017.11.25T06:00:000 2017.11.25T09:00:00 2017.11.25T12:00:00 2017.11.25T14:00:00 2017.11.25T18:00:00 2017.11.25T22:00:00 2017.11.25T23:59:59)
SatBehavior = mr(ds2, caculate, , unionAll{, false})

ds3 = repartitionDS(<select * from user>, `behaveTime, RANGE,2017.11.26T00:00:00 2017.11.26T06:00:000 2017.11.26T09:00:00 2017.11.26T12:00:00 2017.11.26T14:00:00 2017.11.26T18:00:00 2017.11.26T22:00:00 2017.11.26T23:59:59)
SunBehavior = mr(ds3, caculate, , unionAll{, false})

The usage rate of Taobao APP on Saturday and Sunday is higher than that on weekdays. Similarly, the peak of Taobao APP usage on Saturdays and Sundays is early morning (0 am to 6 am).

3. Commodity analysis

allItems=select distinct(itemID) from user
4142583

In these 9 days, 4,142,583 kinds of commodities were involved.

Count the number of purchases of each product:

itemBuyTimes=select count(userID) as times from user where behavior="buy" group by itemID order by times desc

Statistics of the top 20 products sold:

salesTop=select top 20 * from itemBuyTimes order by times desc

The product with ID 3122135 has the highest sales volume, with 1,408 purchases in total.

Count the number of commodities under each purchase frequency:

buyTimesItemNum=select count(itemID) as itemNums from itemBuyTimes group by times order by itemNums desc

The results showed that the vast majority (370,747 kinds) of commodities were purchased only once in these 9 days, accounting for 8.94% of all commodities. The more purchases, the fewer the number of commodities involved.

Count the number of user behaviors of all products:

allItemsInfo=select sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user group by itemID 

The top 20 products in statistics:

pvTop=select top 20 itemID,pageView from allItemsInfo order by pageView desc

The ID of the most viewed product is 812879, with a total of 29,720 views, but the sales volume is only 135, which is not in the top 20 sales.

Count the number of user behaviors of the top 20 products sold:

select * from ej(salesTop,allItemsInfo,`itemID) order by times desc

The highest-selling product 3122135 has 1777 views, which is not among the top 20 views, and the conversion rate from browse to purchase is as high as 79.2%. This product may be just-needed supplies, and users do not need to browse too much to decide to buy.

 

Extended exercise:

(1) Calculate the purchase rate of Taobao APP per hour on 2017.11.25 (purchase rate = number of purchases/total number of behaviors * 100%)

(2) Find out the user with the most purchases and his most purchased products

(3) Calculate the number of purchases of the product with the product ID 3122135 in each time period

(4) Count the frequency of each behavior in each category

(5) Calculate the highest-selling products in each category

 

This tutorial is for learning purposes only.

If you have any questions during the use process, welcome to join Zhinian Technology: DolphinDB Technical Exchange Group, including QR code

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/110958803