Based on real cases, comprehensively interpret the execution plan in NebulaGraph

This article is compiled from NebulaGraph core developer Yee's topic sharing in the live broadcast "Talk about Execution Plans". For sharing videos, see Station B: https://www.bilibili.com/video/BV1Cu4y1h7gn/

The life of a Query

Before we begin to formally interpret the execution plan, let us first understand how a query statement (Query) is verified, generates a syntax tree, and is finally converted into a logical/physical execution plan in NebulaGraph. And this Query life cycle, whether it is NebulaGraph's native query language nGQL or openCypher that is compatible starting from v2.x, will go through the process from string to execution plan. This process corresponds to the programming language and is a compilation process.

The life cycle of a query statement roughly has the following stages:

Similar to most databases, a Query string will be converted into an AST (Abstract Syntax Tree) through Parser (lexical and syntax parser). In the abstract syntax tree, different query statements will be converted into different abstract syntax tree nodes. Then it goes through a Validator to verify the validity of the statement. The verification phase is mainly based on the Schema information (metadata information) of the created point and edge elements. Since the NebulaGraph query language was originally designed with schema-based settings, relevant deductions can be made in advance (based on the statement context) when querying statements, such as which (Schema) types will be used in a certain query.

Here is a knowledge point. In NebulaGraph, you need to pay attention to two key types: EdgeType (Edge) and Tag (Vertex). Like the edge type EdgeType corresponding to each Edge, the edge type attributes it contains have specific Data Types. In addition, it is the attribute of each point Vertex. The attributes of each point will be classified into the corresponding Tag and become its corresponding point type attribute Tag property. By analogy with a relational database, Tag is equivalent to a table. In NebulaGraph, although the attributes of points and edges are subject to Schema constraints, the types corresponding to the two end points of an edge are not constrained. Therefore, the insertion point or modification point is related to the ID (VID), regardless of the point type.

Returning to the Query life cycle above, after statement verification, the previous lexicon is converted into an execution plan containing specific operators in NebulaGraph. Subsequently, the execution plan will be optimized and transformed into a better execution plan through the Optimizer, and then handed over to the scheduler for execution. In traditional databases, execution plans are divided into logical plans and physical plans, but currently there is only one plan in NebulaGraph, which is the physical plan. As in the flow chart above, the plan processed by Optimzer is a physical plan. Therefore, the optimized plan optimized by the Optimizer is actually a physical plan that can be directly executed.

The running of the execution plan is mainly completed by the Executor in the picture above (other databases may call it Operator). When executing this layer, interface interactions will be involved. For example: "dealing" with the storage layer will call the rpc interface; interacting with the graph layer will call some calculation operators, such as pure calculations such as Project and Filter in bold above. operator. In addition, execution will also involve execution models, such as multi-threading, runtime and other issues. Some execution performance problems encountered by everyone in the community may be related to them.

Execution plan example

GO 3 STEPS FROM "player100" OVER follow WHERE $$.player.age > 34 YIELD DISTINCT $$.player.name AS name, $$.player.age AS age | ORDER BY $-.age ASC, $-.name DESC;

Example source: "nGQL Concise Tutorial vol.02 Execution Plan Detailed Explanation and Tuning" Website: https://discuss.nebula-graph.com.cn/t/topic/12010

This is a typical nGQL query statement, traversed through the clause, filtered GOthrough the clause, and finally sorted and output.WHEREORDER BY

The following figure is the execution plan generated by the above query statement:

go_n_steps

As shown above, this execution plan is relatively complex. It is different from the execution plan of the relational database that everyone is used to. For example, the Loop in the lower right corner and the single-dependent LeftJoin in the upper left corner may be different from what you have come across before.

From a structural point of view, unlike Neo4j's Tree structure, the execution plan in NebulaGraph not only has a direction, but also a cycle. Why is NebulaGraph's execution plan structure so different?

explain format="dot" {
 $a=go 2 steps from "Tim Duncan" over like yield like.likeness as c1;
 $b=go 1 steps from "Tony Parker" over like yield like.likeness as c1;
 $c=yield $a.c1 as c1 union yield $b.c1 as c1;
}

Let's take a look at the above example, where a, b, and c variables are defined. There is no relationship between a and b. They are the output results after several hop traversal. And c performs a union operation on these two variables. Let’s take a look at the execution plans corresponding to these three statements:

It can be seen that the first statement, the second statement, and the third statement are a sequential dependency in execution dependency: the yellow circle in the above figure is the dividing line, below the yellow circle is the plan of a, and above the yellow circle, Below Union_11 is the plan for b, and above Union_11 is the plan for c.

The data dependency can be seen in the arrow part. The data dependency input of Project_9 is a, which is provided by Project_4, not PassThrough_13 above and below the sequence. Similarly, the data dependency of Project_10 is the output variable b of Project_8.

What takes some time to distinguish here is: execution dependencies and data dependencies. Execution dependency is the order dependency of the black arrow in the figure above, while data dependence depends on where the inputVar corresponding to the operator comes from. Therefore, data dependencies may be in the same logical order as execution dependencies, or they may be in a different order.

The reason for the above situation is that the nGQL query language itself is very flexible. Unlike SQL, which generates execution plans for a, b, and c respectively, and generates three execution plans, in NebulaGraph, a and b like the above have no data dependency relationship, and c depends on a and b. When generating the plan, they will be It is processed as a statement with a certain sequential sequence. One execution plan can complete the execution of multiple statements, which is of course very flexible, but it also increases the difficulty of statement tuning in disguise.

Let’s look at an openCypher example:

match (v:player)
with v.name as name
match (v:player)--(v2) where v2.name = name
with v2.age as age
match (v:player) where v.age>age
return v.name

In this multiple MATCH example, each MATCH can be used as a Pattern. The output of the first MATCH in this statement is passed to the second MATCH below to become filtered input, while the filtered input of the third MATCH is derived from the output of the second MATCH. Each MATCH is connected through WITH, which is different from nGQL. There is the concept of Subquery. In openCypher, the data dependency relationship is linear, and it is very naturally passed to the next Pattern through WITH. If the first WITH does not carry the result to the second MATCH, the third MATCH cannot be completed.

It is precisely because of this linear relationship that Neo4j's execution plan is tree-shaped; and nGQL generates a directed cyclic graph in the execution plan because of query flexibility.

Execution plan optimization

Currently, NebulaGraph only performs RBO optimization, namely: Rule-Based Optimization. NebulaGraph's RBO is based on a popular implementation method in the industry, with some differences. Users can optimize (deform) the plan based on previous experience or relevant rules.

NebulaGraph's RBO is implemented in memo + bottom up mode.

In NebulaGraph, the plan node of each plan will be placed in a group. The plan nodes in each group are semantically equivalent. The previous group node is connected to the next group, and there are multiple plan nodes in the next group. If so, multiple plans can be changed.

The entire optimization process is actually an iterative process. The optimizer will find the leafmost plan node in each plan and match it with the rule set. The rule set can be viewed here: https://github.com/vesoft-inc/ nebula/tree/master/src/graph/optimizer/rule . If the rules match, the two subplans are transformed, a new subplan is generated, and inserted into the relevant group.

For example, if the Filter operator on the left side of the figure above matches the GetNeighbors operator through a certain rule, a new GetNeighbors node will be generated with Filter. The new GetNeighbors node will be inserted into the group where the previous Filter is located. Because of the memo, the new GetNeighbors can be connected to the subplan Start connected to the previous GetNeighbors.

Based on RBO default, if the rules are matched, the new plan generated after matching the rules is better than the previous plan, and the new plan will directly replace the old plan. Therefore, this execution plan will eventually become like this on the right.

Several execution plan tuning flags

As mentioned above, there will be a Runtime phase in the execution plan. The following parameters mainly affect the execution efficiency of the Runtime phase of the graph.

  • --max_job_size: The maximum number of concurrent jobs within each operator;
  • --min_batch_size: The minimum number of batch lines when splitting tasks when the operator is executed concurrently;
  • --optimize_appendvertices: Performance optimization control within the AppendVertices operator. This can be turned on to reduce RPC calls when there are no dangling edges.

max_job_sizeIt mainly controls the concurrency degree of some operators in the graph and min_batch_sizeis used in conjunction with . This is related to the materialization model of NebulaGraph. After each operator is executed in NebulaGraph, the result will be materialized into the memory. In the next iteration, the data will be retrieved from the corresponding memory instead of calculating through Pipeline. . max_job_sizeand min_batch_sizeThese two flags control the number of batches that each thread can process during the iteration process, thereby achieving optimization effects.

optimize_appendverticesParameters are mainly used to serve MATCH statements. When we use MATCH, we may often encounter a situation: when using MATCH to do path search, we hope that there are no dangling edges in the path. If you know that there is no dangling edge in your data, you can set it optimize_appendverticesto trueso that you no longer need to interact with storage to verify whether the starting point of this edge exists, which will save execution time.

What does the execution plan look like?

The above mentioned some principles, but the following is a practical operation to teach you how to understand the execution plan and how to optimize it.

Add PROFILEor EXPLAINin front of the corresponding statement to get the relevant execution plan. like this:

profile GET SUBGRAPH 5 STEPS FROM "Yao Ming" OUT like where like. likeness > 80 YIELD VERTICES AS nodes, EDGES AS relationshipis;
explain GET SUBGRAPH 5 STEPS FROM "Yao Ming" OUT Like where like. likeness > 80 YIELD VERTICES AS nodes, EDGES AS relationshipis;

If you don't know the amount of data, you can use EXPLAINto view the corresponding execution plan composition. It does not include the execution time of each operator of the statement.

In the community, one type of question often comes up: when I perform conditional filtering through SUBGRAPH, will edge filtering be applied to every hop? I believe that through this example, you can know whether conditional filtering will be applied to every hop.

Execution Plan (optimize time 41 us)

-----+-------------+--------------+----------------+----------------------------------
| id | name        | dependencies | profiling data | operator info                   |
-----+-------------+--------------+----------------+----------------------------------
|  2 | DataCollect | 1            |                | outputVar: {                    |
|    |             |              |                |   "colNames": [                 |
|    |             |              |                |     "nodes",                    |
|    |             |              |                |     "relationshipis"            |
|    |             |              |                |   ],                            |
|    |             |              |                |   "type": "DATASET",            |
|    |             |              |                |   "name": "__DataCollect_2"     |
|    |             |              |                | }                               |
|    |             |              |                | inputVar: [                     |
|    |             |              |                |   {                             |
|    |             |              |                |     "colNames": [],             |
|    |             |              |                |     "type": "DATASET",          |
|    |             |              |                |     "name": "__Subgraph_1"      |
|    |             |              |                |   }                             |
|    |             |              |                | ]                               |
|    |             |              |                | distinct: false                 |
|    |             |              |                | kind: SUBGRAPH                  |
-----+-------------+--------------+----------------+----------------------------------
|  1 | Subgraph    | 0            |                | outputVar: {                    |
|    |             |              |                |   "colNames": [],               |
|    |             |              |                |   "type": "DATASET",            |
|    |             |              |                |   "name": "__Subgraph_1"        |
|    |             |              |                | }                               |
|    |             |              |                | inputVar: __VAR_0               |
|    |             |              |                | src: COLUMN[0]                  |
|    |             |              |                | tag_filter:                     |
|    |             |              |                | edge_filter: (like.likeness>80) |
|    |             |              |                | filter: (like.likeness>80)      |
|    |             |              |                | vertexProps: [                  |
|    |             |              |                |   {                             |
|    |             |              |                |     "props": [                  |
|    |             |              |                |       "_tag"                    |
|    |             |              |                |     ],                          |
|    |             |              |                |     "tagId": 2                  |
|    |             |              |                |   },                            |
|    |             |              |                |   {                             |
|    |             |              |                |     "props": [                  |
|    |             |              |                |       "_tag"                    |
|    |             |              |                |     ],                          |
|    |             |              |                |     "tagId": 4                  |
|    |             |              |                |   },                            |
|    |             |              |                |   {                             |
|    |             |              |                |     "props": [                  |
|    |             |              |                |       "_tag"                    |
|    |             |              |                |     ],                          |
|    |             |              |                |     "tagId": 3                  |
|    |             |              |                |   }                             |
|    |             |              |                | ]                               |
|    |             |              |                | edgeProps: [                    |
|    |             |              |                |   {                             |
|    |             |              |                |     "props": [                  |
|    |             |              |                |       "_dst",                   |
|    |             |              |                |       "_rank",                  |
|    |             |              |                |       "_type",                  |
|    |             |              |                |       "likeness"                |
|    |             |              |                |     ],                          |
|    |             |              |                |     "type": 5                   |
|    |             |              |                |   }                             |
|    |             |              |                | ]                               |
|    |             |              |                | steps: 5                        |
-----+-------------+--------------+----------------+----------------------------------
|  0 | Start       |              |                | outputVar: {                    |
|    |             |              |                |   "colNames": [],               |
|    |             |              |                |   "type": "DATASET",            |
|    |             |              |                |   "name": "__Start_0"           |
|    |             |              |                | }                               |
-----+-------------+--------------+----------------+----------------------------------

Tue, 17 Oct 2023 17:35:12 CST

The execution plan generated in the terminal is in a table structure, and the corresponding dependencies can be viewed through id and dependence, thereby solving the previously mentioned problem of loops in the execution plan.

Let’s first look at the subgraph operator information. We can see that edge_filter contains an expression. If edge_filter / tag_filter contains an expression, it means that the expression has been calculated and pushed down.

The following is PROFILEa more specific execution plan viewed through:

The DataCollect operator has these parameters:

  • execTime: graphd processing time;
  • rows:The number of data items returned;
  • totalTime: Time from operator start to operator exit;
  • version: Like the LOOP mentioned in the previous chapter, there is actually a LOOP body in the LOOP, which corresponds to a subplan. This subplan will be executed repeatedly, and the data will be placed in the same variable, and this variable will have multiple versions, namely version. If the operator is not in the LOOP body, there is only one verison. Default is 0;

The Subgraph operator has these parameters (same as the parameter deduplication explained above):

  • resp[]: The execution result of each storage node
  • exec: The storage processing time is the same as the graphd processing time above execTime;
  • storage_detail: Storage information. Since subgraph needs to interact with storage, this field is used to record the execution time of the plan node on the storage side;
  • total: The time when the storage client receives the graphd request and sends the request to the storage client, that is, the processing time of the storage itself plus the serialization and deserialization time;
  • vertices: The data of the point involved in this statement;
  • total_rpc_time: The time from when graphd calls the storage client to send a request to when it receives the request;

Generate execution plan format

The format of the execution plan displayed at the beginning is different from the execution plan displayed during actual operation. profile format=""To complete the format specification, the default is a table structure. You can specify .dotthe format through the format. Copy the execution plan in the terminal and go online. Website: https://dreampuf.github.io/GraphvizOnline/#digraph for format rendering.

The above is a comprehensive explanation of the execution plan. Of course, you can learn more examples and master the execution plan by reading Siwei's "nGQL Concise Tutorial Vol.02 Detailed Explanation and Tuning of Execution Plans" .

Recommended reading

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4169309/blog/10143302