Dry goods丨How to use DolphinDB, a time series database, for Taobao user behavior analysis

DolphinDB is a new generation of high-performance distributed time-series database (time-series database) with rich data analysis and distributed computing functions. This tutorial uses DolphinDB database to analyze the user behavior data of Taobao APP and further analyze business problems.

Data source: User Behavior Data from Taobao for Recommendation-Data Set-Alibaba Cloud Tianchi

In this tutorial, we encapsulate DolphinDB and the data set used in docker. DolphinDB's distributed database dfs://user_behavior is included in docker. It contains a table user, which saves the behavior records of nearly one million Taobao APP users between November 25, 2017 and December 3, 2017. We adopt a combined partitioning method. The first layer is partitioned by date, one partition per day, and the second layer is hash partitioned by userID, which is divided into 180 partitions. The structure of the user table is as follows:

26b7b928608835830c1e3a0b6eb63c33.png

The meanings of various user behavior types are as follows:

  • pv: browse product detail page
  • buy: Commodity purchase
  • cart: add the product to the shopping cart
  • fav: Favorites

1. Download the docker deployment package

This tutorial has encapsulated DolphinDB and the data used in a docker container. Make sure that the docker environment has been deployed before use. docker installation tutorial, please refer to HTTPS: // docs.docker.com/install / . Download the deployment package from http://www. dolphindb.cn/downloads/ bigdata.tar.gz , and execute the following code in the directory where the deployment package is located.

Unzip the deployment package:

gunzip bigdata.tar.gz

Import the container snapshot as a mirror:

cat bigdata.tar | docker import - my/bigdata:v1

Get the ID of the mirror my/bigdata:v1:

docker images

Start the container (replace the images id according to the actual situation):

docker run -dt -p 8888:8848 --name test <image id> /bin/bash ./dolphindb/start.sh

Enter the IP address of the machine: 8888 in the address bar of the browser, such as localhost:8888, and enter DolphinDB Notebook. The following codes are all executed in DolphinDB Notebook.

The DolphinDB license in the docker is valid until September 1, 2019. If the license file expires, you only need to download the community version from the DolphinDB official website , and replace bigdata.tar/dolphindb/dolphindb.lic with the community version license.

2. User Behavior Analysis

View data volume:

login("admin","123456")
user=loadTable("dfs://user_behavior","user")
select count(*) from user
98914533

There are 98,914,533 records in the user table.

Analyze the behavior of users from browsing to final purchase of goods:

PV=exec count(*) from user where behavior="pv"
88596903
UV=count(exec distinct userID from user)
987984

In these 9 days, the page views of Taobao APP were 88,596,903, and the number of unique visitors was 987,984.

The exec used above is a unique function of DolphinDB, which is similar to select. The difference between the two is that the select statement always returns a table. When exec selects a column, it returns a vector. When used with aggregate functions, it returns a scalar. When used with pivoy by, it returns a matrix to facilitate subsequent data analysis. Calculation.

Count the number of users who only browsed the page once:

onceUserNum=count(select count(behavior) from user group by userID having count(behavior)=1)
92
jumpRate = onceUserNum \ UV * 100
0.009312

Only 92 users left the APP after browsing only one page, accounting for 0.0093% of the total number of users, which is almost negligible, indicating that Taobao is attractive enough to keep users in the APP.

Count the number of individual user behaviors:

behaviors=select count(*) as num from user group by behavior

a188da24c8210ef36ce8b9138891422f.png

Calculate the conversion rate from browsing to intentional purchase:

Adding products to the shopping cart and collecting products can be considered that the user intends to buy. Count the number of user behaviors with intention to purchase:

fav_cart=exec sum(num) from behaviors where behavior="fav" or behavior="cart"
8318654
intentRate=fav_cart\PV*100
9.389328

The conversion rate from browsing to intentional purchase is only 9.38%.

buy=(exec num from behaviors where behavior="buy")[0]
1998976
buyRate=buy\PV*100
2.256259
intent_buy=buy\fav_cart*100
24.030041

从浏览到最终购买只有2.25%的转化率,从有意向购买到最终购买的转化率为24.03%,说明大部分用户用户会把中意的商品收藏或加入购物车,但不一定会立即购买。

对各种用户行为的独立访客进行统计:

userNums=select count(userID) as num from (select count(*) from user group by behavior,userID) group by behavior

19bdfe94d350330dd535327839e29fc5.png

pay_user_rate=(exec num from userNums where behavior="buy")[0]\UV*100
67.852313

这9天中,使用淘宝APP的付费用户占67.8%,说明大部分用户会在淘宝APP上购物。

统计每天各种用户行为的用户数量:

dailyUserNums=select sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user group by date(behaveTime) as date

4c23a905246de1984453739926ea83d9.png

周五、周六和周日(2017.11.25、2017.11.26、2017.12.02、2017.12.03)淘宝APP的访问量明显增加。

iif是DolphinDB的条件运算符,它的语法是iif(cond, trueResult, falseResult),cond通常是布尔表达式,如果满足cond,则返回trueResult,如果不满足cond,则返回falseResult。

分别统计每天不同时间段下各种用户行为的数量。我们提供了以下两种方法:

第一种方法是分别统计各个时间段的数据,再把各个结果合并。例如,统计工作日2017.11.29(周三)不同时间段的用户行为数量。

re1=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T00:00:00 : 2017.11.29T05:59:59

re2=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T06:00:00 : 2017.11.29T08:59:59

re3=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T09:00:00 : 2017.11.29T11:59:59

re4=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T12:00:00 : 2017.11.29T13:59:59

re5=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T14:00:00 : 2017.11.29T17:59:59

re6=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T18:00:00 : 2017.11.29T21:59:59

re7=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T22:00:00 : 2017.11.29T23:59:59

re=unionAll([re1,re2,re3,re4,re5,re6,re7],false)

3b90d72f4c8a992afb9f3ee6e25c1662.png

这种方法比较简单,但是需要编写大量重复代码。当然也可以把重复代码封装成函数。

def calculateBehavior(startTime,endTime){
    return select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between startTime : endTime
}

这样只需要指定时间段的起始时间即可。

另外一种方法是通过DolphinDB的Map-Reduce框架来完成。例如,统计工作日2017.11.29(周三)的用户行为。

def caculate(t){
	return select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from t	
}
ds1 = repartitionDS(<select * from user>, `behaveTime, RANGE,2017.11.29T00:00:00 2017.11.29T06:00:000 2017.11.29T09:00:00 2017.11.29T12:00:00 2017.11.29T14:00:00 2017.11.29T18:00:00 2017.11.29T22:00:00 2017.11.29T23:59:59)
WedBehavior = mr(ds1, caculate, , unionAll{, false})

778b7add5f1b7f84fe45bb9f0ce6569f.png

我们使用repartitionDS函数对user表重新按照时间范围来分区(不改变user表原来的分区方式),并生成多个数据源,然后通过mr函数,对数据源进行并行计算。DolphinDB会把caculate函数应用到各个数据源上,然后把各个结果合并。

工作日,凌晨(0点到6点)淘宝APP的使用率最高,其次是下午(14点到16点)。

统计周六(2017.11.25)和周日(2017.11.26)的用户行为:

ds2 = repartitionDS(<select * from user>, `behaveTime, RANGE,2017.11.25T00:00:00 2017.11.25T06:00:000 2017.11.25T09:00:00 2017.11.25T12:00:00 2017.11.25T14:00:00 2017.11.25T18:00:00 2017.11.25T22:00:00 2017.11.25T23:59:59)
SatBehavior = mr(ds2, caculate, , unionAll{, false})

3b90d72f4c8a992afb9f3ee6e25c1662.png

ds3 = repartitionDS(<select * from user>, `behaveTime, RANGE,2017.11.26T00:00:00 2017.11.26T06:00:000 2017.11.26T09:00:00 2017.11.26T12:00:00 2017.11.26T14:00:00 2017.11.26T18:00:00 2017.11.26T22:00:00 2017.11.26T23:59:59)
SunBehavior = mr(ds3, caculate, , unionAll{, false})

c4bd8e46a49a083bb3132870fd6a0d79.png

周六和周日各个时间段淘宝APP的使用率都比工作日的使用率要高。同样地,周六日淘宝APP使用高峰是凌晨(0点到6点)。

3. 商品分析

allItems=select distinct(itemID) from user
4142583

在这9天中,一共涉及到4,142,583种商品。

统计每个商品的购买次数:

itemBuyTimes=select count(userID) as times from user where behavior="buy" group by itemID order by times desc

统计销量前20的商品:

salesTop=select top 20 * from itemBuyTimes order by times desc

ID为3122135的商品销量最高,一共有1,408次购买。

统计各个购买次数下商品的数量:

buyTimesItemNum=select count(itemID) as itemNums from itemBuyTimes group by times order by itemNums desc

结果显示,绝大部分(370,747种)商品在这9天中都只被购买了一次,占所有商品的8.94%。购买次数越多,涉及到的商品数量越少。

统计所有商品的用户行为数量:

allItemsInfo=select sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user group by itemID

统计浏览量前20的商品:

pvTop=select top 20 itemID,pageView from allItemsInfo order by pageView desc

浏览量最高的商品ID为812879,共有29,720次浏览,但是销量仅为135,没有进入到销量前20。

统计销量前20的商品各个用户行为的数量:

select * from ej(salesTop,allItemsInfo,`itemID) order by times desc

The highest-selling product 3122135 has 1777 views, which is not among the top 20 views, and the conversion rate from browse to purchase is as high as 79.2%. This product may be just-needed supplies, and users do not need to browse too much to decide to buy.


Extended exercise:

(1) Calculate the purchase rate of Taobao APP per hour on 2017.11.25 (purchase rate = number of purchases/total number of behaviors * 100%)

(2) Find out the user with the most purchases and his most purchased products

(3) Calculate the number of purchases of the product with the product ID 3122135 in each time period

(4) Count the frequency of each behavior in each category

(5) Calculate the highest-selling products in each category


This tutorial is for learning purposes only.


Guess you like

Origin blog.51cto.com/15022783/2606505