User behavior analysis model practice (1)-path analysis model

1. Demand background

In the practice of Internet data operation, there is a type of data analysis application that is unique to the Internet industry-path analysis. The path analysis application is to visually display the upstream and downstream of a specific page and analyze the path distribution of users when using the product. For example: when a user uses an APP, how do they enter the [detail page] from the [home page], and what is the proportion of users entering the [detail page], [play page], and [download page] from the [home page], and Can help us analyze what node the user left.

In the design of the specific technical plan corresponding to the scene, we divide the access data according to the session to dig out the paths that users frequently access; the function allows users to instantly view the related paths of the selected nodes, and supports users to customize the starting or ending points of the paths. And support to view the conversion result analysis of different target groups on the same behavior path according to the new users/active users of the business, to meet the needs of refined analysis.

1.1 Application scenarios

Generally, the main issues that users pay attention to in scenarios where path analysis is required:

  • What are the main paths of users in the APP according to the conversion rate from high to low;
  • After the user leaves the expected path, what is the actual direction?
  • What are the differences in user behavior paths with different characteristics?

Through an actual business scenario, we can see how the path analysis model solves such problems;

【Business scene】

Analyze the main behavior path of "active users" to reach the target landing page [small video page] (the daily data volume is one billion level, and the calculation result output time is about 1s)

【User Operation】

  1. Select the start/end page and add the filter condition "User";

  2. Select the type "Visits"/"Sessions";

  3. Click on the query to produce the results in real time.

User behavior analysis model practice (1)-path analysis model

2. Basic concepts

Before proceeding with the specific data model and engineering architecture design, first introduce some basic concepts to help everyone better understand this article.

2.1 Path analysis

Path analysis is one of the commonly used data mining methods, which is mainly used to analyze the path distribution of users when using the product, and dig out the user's frequent access paths. Like the funnel function, path analysis explores the steps taken by users during their stay on your website or app, but path analysis can study multiple paths at random instead of just analyzing a pre-set path .

2.2 Session和Session Time

Different from the Session in the WEB application, the Session in the data analysis refers to a series of interactions that occur on the website within a specified time period. The meaning of Session Time in this model is that when the interval between two behaviors exceeds the Session Time, we believe that these two behaviors do not belong to the same path.

2.3 Sankey Diagram

Sankey diagram (Sankey diagram), namely Sankey energy split diagram, also called Sankey energy balance diagram. It is a specific type of flowchart, and the width of the branch in the figure corresponds to the size of the data flow. As shown in Figure 4.1-1, each edge represents the traffic from the previous node to that node. A complete Sankey chart includes the following contents: node data and node conversion rate (red box in the figure below), edge data and edge conversion rate (black box in the figure below). For the calculation of conversion rate, please refer to [3.5. Conversion rate calculation].

User behavior analysis model practice (1)-path analysis model

2.4 Adjacency list

The construction of Sankey diagram can be simplified to the problem of compressed storage of a graph. The graph usually consists of several parts:

  • Edge
  • Point (vertex)
  • Weight
  • 度(degree)

In this model, we use an adjacency list for storage. The adjacency table is a commonly used graph compression storage structure. It uses a linked list to save the nodes and edges in the graph while ignoring the edges that do not exist between the nodes, thereby compressing the matrix. The structure of the adjacency list is as follows:

User behavior analysis model practice (1)-path analysis model

In (a), the left side is the vertex node, which contains the vertex data and the pointer to the first edge; the right side is the edge node, which contains the edge weight, ingress and egress information, and the pointer to the next edge. A complete adjacency list is similar to the structure of Hashmap, as shown in (b), the left side is a sequence table, which stores the edge nodes in (a); each edge node corresponds to a linked list to store the edges connected to the node . In the page path model, in order to meet the needs of the model, we have modified the vertex node and edge node structure, see section [4.1] for details.

2.5 Tree pruning

Pruning is an important step in the construction of a tree. It refers to deleting some unimportant nodes to reduce the complexity of calculation or search. In the page path model, we trim the tree constructed by the original data in the pruning process, and remove the unqualified branches to ensure the integrity of each root node to leaf node path in the tree.

2.6 PV and SV

PV stands for Page View, the number of visits. In this model, it refers to the number of visits within a period of time; SV stands for Session View, the number of sessions. In this model, it refers to the number of sessions where the access path has appeared. For example, there are path one: A → B → C → D → A → B and path two: A → B → D, then the PV of A → B is 2+1=3, and the SV is 1+1=2.

 Three, data model design

This section will introduce the design of the data model, including data flow, path division, ps/sv calculation, and the final conversion rate calculation of the path in the Sankey diagram.

3.1 Overall data flow

The data comes from a unified data warehouse, is calculated by Spark and then written into Clickhouse, and is cold backed up by Hive. The data flow diagram is shown in Figure 3.1-1.

User behavior analysis model practice (1)-path analysis model

Figure 3.1-1

3.2 Technical selection

Clickhouse is not the focus of this article. I will not describe it in detail here, but only briefly explain the reasons for choosing Clickhouse.

The reason for the choice is that Clickhouse is a columnar storage, which is extremely fast. Take a look at the data size and query speed (as of the date of writing this article):

User behavior analysis model practice (1)-path analysis model

Figure 3.2-1

The query speed of hundreds of billions of data obtained at the end is like this,

User behavior analysis model practice (1)-path analysis model

Figure 3.2-2

3.3 Data modeling

3.3.1 Get page information and divide session

The page path model obtains corresponding page ids based on various event IDs to perform page path analysis. The concept of Session can be seen in Section 2.2, which will not be repeated here. At present, we use a more flexible session division, so that users can query the user's page conversion information at various time granularities (5, 10, 15, 30, 60 minutes) of the session.

Suppose there are user a and user b. The behavior events of user a that day are E1, E2, E3..., and the corresponding pages are P1, P2, P3..., and the time of the event is T1, T2, T3..., the selected session interval is tg. As shown in the figure, T4-T3>tg, so P1, P2, P3 are divided into the first Session, P4, P5 are divided into the second Session, and P6 and the following pages are also divided into the new Session.

User behavior analysis model practice (1)-path analysis model

The pseudo code is implemented as follows:

def splitPageSessions(timeSeq: Seq[Long], events: Seq[String], interval: Int)
                     (implicit separator: String): Array[Array[Array[String]]] = {
  // 参数中的events是事件集合,timeSeq是相应的事件发生时间的集合
  if (events.contains(separator))
    throw new IllegalArgumentException("Separator should't be in events.")
  if (events.length != timeSeq.length)
    throw new Exception("Events and timeSeq not in equal length.")
  val timeBuf = ArrayBuffer[String](timeSeq.head.toString) // 存储含有session分隔标识的时间集合
  val eventBuf = ArrayBuffer[String](events.head) // 存储含有session分隔标识的事件集合
  if (timeSeq.length >= 2) {
    events.indices.tail.foreach { i =>
      if (timeSeq(i) - timeSeq(i - 1) > interval * 60000) { // 如果两个事件的发生时间间隔超过设置的时间间隔,则添加分隔符作为后面划分session的标识
        timeBuf += separator;
        eventBuf += separator
      }
      timeBuf += timeSeq(i).toString;
      eventBuf += events(i)
    }
  }
  val tb = timeBuf.mkString(",").split(s",\\$separator,").map(_.split(",")) // 把集合通过标识符划分成为各个session下的时间集合
  val eb = eventBuf.mkString(",").split(s",\\$separator,").map(_.split(",")) // 把集合通过标识符划分成为各个session下的事件集合
  tb.zip(eb).map(t => Array(t._1, t._2)) // 把session中的事件和发生时间对应zip到一起,并把元组修改成数组类型,方便后续处理
}

3.3.2 Adjacent page de-duplication

Different events may correspond to the same page, and the same adjacent pages need to be filtered out, so what needs to be done after dividing the session is to de-duplicate adjacent pages.

User behavior analysis model practice (1)-path analysis model

Figure 3.3-2

The result obtained after deduplication of adjacent pages is like this

User behavior analysis model practice (1)-path analysis model

Figure 3.3-3

3.3.3 Get the first/last four pages of each page

Then perform window function analysis on the above data to obtain the four levels of pages before and after each page in each session, where the sid is spliced ​​according to the user identification ID and the session number, for example, for the first session 0 of user a mentioned above The following 7 records will be generated, the page in the figure is listed as the current page, and the empty page is represented by -1

User behavior analysis model practice (1)-path analysis model

Figure 3.3-4

Calculate the rest, you will get a total of 7+7+6+4+5=29 records. Get all the records as follows

User behavior analysis model practice (1)-path analysis model

3.3.4 Count the pv/sv of the positive and negative paths

Take page and page_id_previous1, page_id_previous2, page_id_previous3, page_id_previous4 to get the negative five-level path (path_direction is 2), take page and page_id_next1, page_id_next2, page_id_next3, page_id_next4 to get the positive five-level path (path), respectively sv (de-duplicate according to sid), get the following data dfSessions,

User behavior analysis model practice (1)-path analysis model

Looking directly at the above data may be at a loss, so here are two data examples, the first result data

User behavior analysis model practice (1)-path analysis model

Figure 3.3-4

This is a positive (path_direction is 1) path result data. In the figure below, it is a path from left to right. The corresponding two paths are as follows

User behavior analysis model practice (1)-path analysis model

Figure 3.3-5

The second result data

User behavior analysis model practice (1)-path analysis model

Figure 3.3-6

It is also a positive path result data, where pv is 2, and the corresponding two paths are as follows. The reason why sv is 1 is that the sid of the two paths are the same, and both are the paths generated by user a in the S1 session

User behavior analysis model practice (1)-path analysis model

Figure 3.3-7

3.3.5 Calculate the pv/sv of all levels of paths

Then according to the dfSessions data, the sum of pv and sv is calculated according to the page_id_lv1 grouping, and the pv and sv of the first-level path are obtained. The path_direction is set to 0 for the first-level path.

User behavior analysis model practice (1)-path analysis model

Then similarly calculate the pv and sv of the second, third, fourth, and fifth level paths, and merge all the results to get the following

User behavior analysis model practice (1)-path analysis model

3.4 Data writing

The result data analyzed and calculated by Spark needs to be written to Clickhouse for online services, and written to Hive as a cold backup of data, which can be used for Clickhouse data recovery.

The Clickhouse table uses a distributed (Distributed) table structure. The distributed table itself does not store any data, but acts as a transparent proxy for data sharding, automatically routing data to each node in the cluster, so the distributed table engine needs to cooperate Use with other data table engines. The table data of the user path analysis model is stored in each shard of the cluster. The sharding method uses random sharding. Here, the data writing of Clickhouse is involved. Let's explain it.

Regarding this point, in the early stage of the model, we used the method of writing distributed tables to write data. The specific writing process is as follows:

  1. The client establishes a jdbc connection with node A in the cluster, and writes data through HTTP POST requests;
  2. After receiving the data, Shard A will do two things. First, divide the data according to the sharding rules, and second, write the data belonging to the current shard into its own local table;
  3. Shard A writes the data belonging to the remote shard into a temporary bin file in the directory in the unit of partition. The naming rule is as follows: /database@host:port/[increase_num].bin;
  4. Fragment A tries to establish a connection with the remote fragment;
  5. There will be another set of monitoring tasks to monitor the temporary bin files generated above, and send these data to the remote shards, and each piece of data is sent in a single thread;
  6. The remote fragment receives data and writes it to the local table;
  7. Fragment A confirms that the writing is complete.

Through the above process, it can be seen that the Distributed table is responsible for the data writing of all shards, so the inbound and outbound traffic of the nodes that establish the jdbc connection will have extremely high peaks, which will cause the following problems:

  1. The load of a single node is too high, which is mainly reflected in the memory, network card inbound and outbound traffic, and TCP connection waiting number, etc., and the health of the machine is very poor;
  2. When the business grows, more models will be connected to Clickhouse for OLAP, which means a larger amount of data. If you continue to write in the current way, it will inevitably cause a single machine to go down. In the current situation where there is no high availability , The downtime of a single machine will cause the unavailability of the entire cluster;
  3. In the future, we will definitely make the high availability of the ck cluster and use the more reliable ReplicatedMergeTree. When using this engine to write data, there will be data inconsistencies due to writing to the distributed table.

For this data end, the DNS polling to write the local table was modified. After the modification:

  • The TCP connection waiting number of the machine used for JDBC connection dropped from 90 to 25, a reduction of more than 72%;
  • The peak inflow of the machine used for JDBC connection has been reduced from 645M/s to 76M/s, a reduction of more than 88%;
  • The outflow of the machine used for JDBC connection due to data distribution is about 92M/s. After the transformation, this part of outflow is cleared to zero.

In addition, when the Distributed table is responsible for writing data to remote fragments, there are two ways of asynchronous writing and synchronous writing. If asynchronous writing is done, the write success message will be returned after the Distributed table is written to the local fragment. If it is Synchronous writing will return a success message after all the fragments have been written. The default situation is asynchronous writing. We can control the waiting timeout time of synchronous writing by modifying the parameters.

def splitPageSessions(timeSeq: Seq[Long], events: Seq[String], interval: Int)
                     (implicit separator: String): Array[Array[Array[String]]] = {
  // 参数中的events是事件集合,timeSeq是相应的事件发生时间的集合
  if (events.contains(separator))
    throw new IllegalArgumentException("Separator should't be in events.")
  if (events.length != timeSeq.length)
    throw new Exception("Events and timeSeq not in equal length.")
  val timeBuf = ArrayBuffer[String](timeSeq.head.toString) // 存储含有session分隔标识的时间集合
  val eventBuf = ArrayBuffer[String](events.head) // 存储含有session分隔标识的事件集合
  if (timeSeq.length >= 2) {
    events.indices.tail.foreach { i =>
      if (timeSeq(i) - timeSeq(i - 1) > interval * 60000) { // 如果两个事件的发生时间间隔超过设置的时间间隔,则添加分隔符作为后面划分session的标识
        timeBuf += separator;
        eventBuf += separator
      }
      timeBuf += timeSeq(i).toString;
      eventBuf += events(i)
    }
  }
  val tb = timeBuf.mkString(",").split(s",\\$separator,").map(_.split(",")) // 把集合通过标识符划分成为各个session下的时间集合
  val eb = eventBuf.mkString(",").split(s",\\$separator,").map(_.split(",")) // 把集合通过标识符划分成为各个session下的事件集合
  tb.zip(eb).map(t => Array(t._1, t._2)) // 把session中的事件和发生时间对应zip到一起,并把元组修改成数组类型,方便后续处理
}

3.5 Conversion rate calculation

Select the corresponding dimension on the front-end page, and select the starting page:

User behavior analysis model practice (1)-path analysis model

The backend will be queried in Clickhouse,

  • The selected node depth (node_depth) is 1 and the first-level page (page_id_lv1) is the data of the selected page, and the first-level page and its sv/pv are obtained,

  • The selected node depth (node_depth) is 2 and the first-level page (page_id_lv1) is the data of the selected page. Take the top 10 in reverse order of sv/pv to get the second-level page and its sv/pv,

  • The selected node depth (node_depth) is 2 and the first-level page (page_id_lv1) is the data of the selected page. Take the top 20 in reverse order of sv/pv to get the third-level page and its sv/pv,

  • The selected node depth (node_depth) is 2 and the first-level page (page_id_lv1) is the data of the selected page. Take the top 30 in reverse order of sv/pv to get the fourth-level page and its sv/pv,

  • The selected node depth (node_depth) is 2 and the first-level page (page_id_lv1) is the data of the selected page. Take the top 50 in reverse order of sv/pv to get the fifth-level page and its sv/pv,

Conversion rate calculation rules:

Page conversion rate:

Suppose there are paths ABC, ADC, ABDC, where ABCD are four different pages respectively

Calculate the conversion rate of the three-level page C:

(All the three-level pages in the path with a node depth of 3 are the pv/sv sum of the path of C) ÷ (the pv/sv of the first-level page) 

Path conversion rate

Suppose there are ABC, ADC, ABDC, where ABCD are four different pages

Calculate the conversion rate of BC in the ABC path:

(The pv/sv of the path ABC) ÷ (the pv/sv sum of the path of B in the second-level page of all the paths with a node depth of 3)

User behavior analysis model practice (1)-path analysis model

 Four, engineering end architecture design

This section will explain the processing architecture of the engineering end, including several aspects: the structure of the Sankey diagram, path merging, conversion rate calculation, and pruning.

4.1 The structure of Sankey diagram

User behavior analysis model practice (1)-path analysis model

As can be seen from the above prototype diagram, we need to construct a Sankey diagram. For the engineering end, we need to construct a weighted path tree.

Simplifying the above figure, you can transform the demand into an adjacency list for constructing a weighted tree. The following figure on the left is our adjacency list design. The order list on the left stores each node (Vertex), including node information such as node name (name), node code (code) and a pointer to the list of edges (Edge); each node (Vertex) points to an edge (Edge) ) Linked list, each edge saves the weight of the current edge, endpoint information, and a pointer to the next edge of the same node.

User behavior analysis model practice (1)-path analysis model

Figure 4.1-2

User behavior analysis model practice (1)-path analysis model

Figure 4.1-3 

Figure 4.1-2 is the adjacency list we used in the model. Here are some changes to the adjacency list described in 2.4. In our Sankey diagram, nodes with the same name and different conversion rates will appear at different levels. These nodes are part of the path and cannot be regarded as duplicate nodes by name and do not form a loop. If the entire Sankey graph is represented by an adjacency list, then such nodes will be treated as the same node, making a loop appear in the image. Therefore, we divide the Sankey diagram according to levels, and each two levels are represented by an adjacency table, as shown in Figure 4.1-2. Level 1 represents the nodes of level 1 and the edges that point to level 2, and Level 2 represents that the nodes of level 2 point to level 3. , And so on.

4.2 Definition of path

First, let’s review the Sankey diagram:

User behavior analysis model practice (1)-path analysis model

Observing the above figure, we can find that we need to calculate four data: the pv/sv of each node, the conversion rate of each node, the pv/sv between nodes, and the conversion rate between nodes. Then we give the definition of these data below:

  • Node pv/sv = the sum of pv/sv of the current node in the current hierarchy

  • Node conversion rate = (node ​​pv/sv) / (path starting node pv/sv)

  • Pv/sv between nodes = pv/sv from the previous node to the current node

  • Conversion rate between nodes = (between nodes pv/sv) / (upper node pv/sv)

Let's look at the route data stored in Clickhouse. Let's take a look at the table structure first:

(
  `node_depth` Int8 COMMENT '节点深度,共5个层级深度,枚举值1-2-3-4-5' CODEC(T64, LZ4HC(0)),
  `page_id_lv1` String COMMENT '一级页面,起始页面' CODEC(LZ4HC(0)),
  `page_id_lv2` String COMMENT '二级页面' CODEC(LZ4HC(0)),
  `page_id_lv3` String COMMENT '三级页面' CODEC(LZ4HC(0)),
  `page_id_lv4` String COMMENT '四级页面' CODEC(LZ4HC(0)),
  `page_id_lv5` String COMMENT '五级页面' CODEC(LZ4HC(0))
)

The above are the more important fields in the path table, which respectively indicate the depth of the node and the nodes at all levels. The data in the table includes the complete path and the intermediate path. The complete path refers to the path from the starting point to the exit, from the starting point to the specified end point, and the path beyond the 5th level is treated as the 5th level path. The intermediate path refers to the intermediate data generated in the data calculation process, and cannot be used as a complete path.

Path data:

(1) Full path

User behavior analysis model practice (1)-path analysis model

User behavior analysis model practice (1)-path analysis model

(2) Incomplete path

User behavior analysis model practice (1)-path analysis model

Then we need to filter out the complete path from the data and organize the path data into a tree structure.

4.3 Design and Implementation

4.3.1 Overall framework

User behavior analysis model practice (1)-path analysis model

The overall implementation idea of ​​the back end is very clear. The main steps are to read data, construct adjacency tables and pruning. So how to realize the complete/non-complete path filtering? We use service layer pruning to filter out incomplete paths. The following is the pseudo code describing the entire process:

// 1-1: 分层读取原始数据
// 1-1-1: 分层构造Clickhouse Sql
    for( int depth = 1; depth <= MAX_DEPTH; depth ++){
        sql.append(select records where node_depth = depth)
    }
// 1-1-2: 读取数据
    clickPool.getClient();
    records = clickPool.getResponse(sql);
// 2-1: 获取节点之间的父子、子父关系(双向edge构造)
    findFatherAndSonRelation(records);
    findSonAndFathRelation(records);
// 3-1: 剪枝
// 3-1-1: 清除孤立节点
    for(int depth = 2; depth <= MAX_DEPTH; depth ++){
        while(hasNode()){
            node = getNode();
            if node does not have father in level depth-1:
                cut out node;
        }
    }
// 3-1-2: 过滤不完整路径
    for(int depth = MAX_DEPTH - 1; depth >= 1; depth --){
        cut out this path;
    }
// 3-2: 构造邻接表
    while(node.hasNext()){
        sumVal = calculate the sum of pv/sv of this node until this level;
        edgeDetails = get the details of edges connected to this node and the end point connected to the edges;
        sortEdgesByEndPoint(edgeDetails);
        path = new Path(sumVal, edgeDetails);
    }

4.3.2 Clickhouse connection pool

We have introduced ClickHouse in the page path, and its features will not be repeated here. We use a simple Http connection pool to connect to ClickHouse Server. The connection pool structure is as follows:

User behavior analysis model practice (1)-path analysis model

4.3.3 Data reading

As described in 2, we need to read the full path in the data.

(
  `node_depth` Int8 COMMENT '节点深度,枚举值',
  `page_id_lv1` String COMMENT '一级页面,起始页面',
  `page_id_lv2` String COMMENT '二级页面',
  `page_id_lv3` String COMMENT '三级页面',
  `page_id_lv4` String COMMENT '四级页面',
  `page_id_lv5` String COMMENT '五级页面',
  `val` Int64 COMMENT '全量数据value'
)

In the above table structure, it can be seen that the path written to the database is a path with a depth of ≤5 after a first-level screening. We need to distinguish the full path from the incomplete path on this basis, and determine whether it is a full path based on node_depth and page_id_lvn and calculate the value of each node as needed.

Full path judgment condition:

  • node_depth=n, page_id_lvn=pageId (n < MAX_DEPTH)
  • node_depth=n, page_id_lvn=pageId || page_id_lvn=EXIT_NODE (n = MAX_DEPTH)

We already know the conditions of the full path, so there are two options for reading the path. Solution 1: Filter directly according to the above conditions to obtain the complete path. Due to the limitations of Clickhouse and back-end performance, you must limit when fetching the number; Solution 2: Reading layer by layer, you can calculate the full amount of data, but it is not guaranteed to retrieve the exact number of paths .

Through observation, there will be duplicate paths in the data, and suppose there are two paths:

A → B → C → D → EXIT_NODE

A → B → E → D → EXIT_NODE

When there are the above two paths, the value of each node needs to be calculated. In actual data, we can only get the value of the current node through an incomplete path. Therefore, Option One is not applicable.

Then the second option can be read layer by layer through the following pseudo code:

for(depth = 1; depth <= MAX_DEPTH; depth++){
    select
        node_depth as nodeDepth,
        ...,
        sum(sv) as val
    from
        table_name
    where
        ...
        AND (toInt16OrNull(pageId1) = 45)
        AND (node_depth = depth)
        ...
    group by
        node_depth,
        pageId1,
        pageId2,
        ...
    ORDER BY
        ...
    LIMIT
        ...
}

The data read is as follows:

User behavior analysis model practice (1)-path analysis model

Then, node1_A_val = 10+20, node2_B_val = 9+15 and so on.

4.3.4 Pruning

According to 4.3.3, we will extract all the original data hierarchically during the fetching stage, and the original data contains complete and incomplete paths. The following figure is a tree constructed directly from the original data (original tree). According to our definition of a complete path: the path depth reaches 5 and the end node is exit or other nodes; the path depth does not reach 5 and the end node is exit. It can be seen that the red part in the figure (node4_lv1 → node3_lv2) is an incomplete path.

In addition, isolated nodes (green node node4_lv2) will appear in the original tree. This is because in the fetching stage, we sort the data in a hierarchical manner and then fetch it, so that the relevance of each layer of data cannot be guaranteed. Therefore, the node4_lv2 node is ranked higher in the lv2 layer, and its predecessor and successor nodes cannot be selected in the lower order, which leads to the generation of isolated nodes.

User behavior analysis model practice (1)-path analysis model

Figure 4.3-3

Therefore, after we take out the original data set, we still need to filter to get the path we really need.

In the model, we implement this filtering operation through pruning.

// 清除孤立节点
    for(int depth = 2; depth <= MAX_DEPTH; depth ++){
        while(hasNode()){
            node = getNode();
            if node does not have any father and son: // [1]
                cut out node;
        }
    }
// 过滤不完整路径
    for(int depth = MAX_DEPTH - 1; depth >= 1; depth --){
        cut out this path; // [2]
    }

In the previous steps, we have obtained the two-way edge list (parent-child relationship and child-parent relationship list). Therefore, in the above pseudo code [1], the edge list can be used to quickly find the predecessor and successor of the current node to determine whether the current node is an isolated node.

Similarly, we use the edge list to clip incomplete paths. For incomplete paths, you only need to care about the paths whose depth is less than MAX_DEPTH and the last node is not EXIT_NODE when pruning. Then in the above pseudo code [2], we only need to judge whether the nodes of the current layer have sequential edges (parent-child relationship), if not, then clear the current node.

 Five, write at the end

Based on the short query time and visualization requirements in platform-based query, combined with existing storage computing resources and specific requirements, we enumerate the path data in the implementation and divide it into two and merge them. The first time is within the same day. The same path is merged, and the second time is to summarize the paths within the date interval. This article hopes to provide a reference for path analysis, and it needs to be reasonably designed in combination with the characteristics of each business when it is used, so as to better serve the business.

The Clickhouse involved in the plan will not be introduced in detail here. Interested students can learn more about it. Welcome to discuss and learn with the author.

Author: vivo Internet Big Data team

Guess you like

Origin blog.51cto.com/14291117/2659982