Dameng database SQL optimization execution plan


The SQL execution plan operator of Dameng database has been introduced before. The specific article is:

Common operators in Dameng database SQL execution plan

This article will introduce the execution plan, hoping to achieve the effect that no matter whether a certain SQL can find an optimization method, we can at least match each part of the execution plan with the original SQL. The execution plan is the most important thing in optimization. Here we mainly explain how to read the execution plan and what points need to be paid attention to, so as to lay a certain foundation for optimization.

1. How to see the execution plan

First of all, the execution plan is a tree composed of various operators, that is, the display form of the sorted operators, which are executed sequentially from the inside to the outside. (Looking at the execution plan generally refers to the way of executing the plan text in the Dameng management tool, so that it can be viewed in more detail, and the plan can be copied to the text editing tool UE, notepad++, so that the indentation is more obvious)

The general execution plan format is:

     OP1
         OP2
              OP3
              OP4
         OP5
              OP6
                  OP7
                  OP8

The more indented, the more executed first, and the same indented upper ones are executed first, and the lower ones are executed later. The priority of the upper and lower is higher than that of the inner and outer. For the simple example above, the execution order is:

OP3->OP4->OP2->OP7->OP8->OP6->OPT5->OP1

Here we give a realistic example, we draw up a SQL with such an execution plan

SQL> CREATE TABLE TEST5(ID INT);
SQL> CREATE TABLE TEST6(ID INT);
SQL> CREATE TABLE TEST7(ID INT);
SQL> CREATE TABLE TEST8(ID INT);
SQL> insert into test5 values(3);
SQL> insert into test6 values(4);
SQL> insert into test7 select level %100 from dual connect by level < 10000;
SQL> insert into test8 select level %100 from dual connect by level < 10000;
SQL> commit;
SQL> explain 
select 
       /*+no_use_cvt_var*/ 
       * 
  from (select test5.id from test5,test6 where test5.id = test6.id) a,(select id 
           from (select test7.id from test7,test8 where test7.id = test8.id) 
       group by id) b 
 where a.id = b.id;

For the execution plan of this example (ignoring /*+no_use_cvt_var*/), we will not pay attention to the PRJT and NSET operators for the time being, but only look at the execution order of SQL

Similar to the previous simple example, the order of execution

6->7->5->12->13->11->9->3

Then the actual execution of SQL is as follows: first execute the HASH connection of TEST5 and TEST6, then execute the HASH connection of TEST7 and TEST8 and group the connection results by HASH, and then perform the HASH connection of the two results again to obtain the final result set.

The SQL writing method in this example is relatively simple and the meaning is very clear. It is not difficult to write down the sequence of operators to understand what SQL needs to do. Similarly, only seeing the execution plan, we need to be able to figure out what the SQL originally looked like. Understanding SQL itself is the key, and the execution plan is more of a reminder, telling everyone what SQL needs to do.

After reading the execution order of the execution plan description normally, we pay attention to the detailed information of each node of the execution plan. There will be a triple behind all the operators in the execution plan, such as:

#CSCN2: [1, 9999, 4]

[1, 9999, 4] is the triplet we mentioned. The three numbers represent the estimated cost of the operator, the number of output rows of the operator, and the row length of the data involved in the operator.

#CSCN2: [1, 9999, 4] means that this is a full table scan operation, the number of rows involved is 9999, the data length of each field is 4, and the overall cost is estimated to be 1.

We call the second item in the triplet the estimated number of rows (card). In complex queries, the estimated number of rows has a great impact on the execution plan and SQL performance.

2. The impact of statistical information on the execution plan

Statistical information can be simply understood as performing statistical analysis on a certain column of the index (including the original table ROWID clustered index), listing its maximum and minimum values, how many different values ​​exist, and how many auxiliary information exists for each value.

For columns without statistical information, Dameng simply performs probability filtering according to a certain ratio.

The INI parameters involved are:

SEL_RATE_EQU ,等值过滤选择率,默认0.025。
SEL_RATE_SINGLE, 一般条件选择率,默认 0.05。

see example

SQL> create table test10(id1 int,id2 varchar,id3 varchar,id4 varchar);
--方便起见,我们插入1W行数据,ID1从1-10000, ID2 为 0a - 4a, id3全为b, id为1c - 10000c
SQL> insert into test10 select level,level % 5 || 'a','b',level || 'c' from dual connect by level <= 10000;
--SEL20
SQL> explain select * from test10 where id1 = 5;

It can be seen that CSCN involves 1W rows of data, which is no problem, but the CARD of the filter condition SLCT is marked as 250 rows (#SLCT2: [1, 250, 156]), which is inconsistent with our expectations because there is no statistical information . The system directly gives the result of 250 according to 10000 * 0.025.

What if there are multiple equivalence conditions?

--SEL21
--我们这里保障列与值类型相同 id2 varchar = '5'
SQL> explain select * from test10 where id1 = 5 and id2 = '5';

 The CARD of SLCT is 6, which is approximately equal to 10000 * 0.025 * 0.025 = 6.25

It can be simply inferred that there are multiple conditions and there is no statistical information, CARD is the product of multiple selection rates multiplied by the number of output rows in the lower layer.

Let's look at the general conditions

--SEL22
SQL> explain select * from test10 where id1 > 5;

SLCT output CARD is 500, consistent with INI default SEL_RATE_SINGLE parameter 0.05 10000 * 0.05 = 500

Generally speaking, we consider all filter conditions except equivalent conditions to be general conditions.

Similarly, when there is no statistical information for the combination of general conditions and equivalent conditions, the final selection rate is still calculated by the product.

--SEL23
SQL> explain select * from test10 where id1 > 5 and id2 = '5';

SLCT CARD = 12 = 10000 * 0.05 * 0.025 = 12.5

Now we collect statistics, there are two recommended ways to collect statistics

--收集单列统计信息
STAT 100 ON 表(列)
--收集SQL语句涉及列的统计信息
CREATE VIEW VA AS SQL语句;
CALL SP_SQL_STAT_INIT('SELECT * FROM VA')
SQL> stat 100 on test10(id1);
操作已执行
已用时间: 26.350(毫秒). 执行号:860.
SQL> stat 100 on test10(id2);
操作已执行
 
收集完毕后,我们再看计划中的CARD值
SQL> explain select * from test10 where id1 = 5;

From the execution plan in the above figure, we can see that the single-column estimate is accurate, and there is only one row with 5 in ID1

SQL> explain select * from test10 where id2 = '5';

 According to the execution plan in the above figure, the single-column estimation is accurate, the minimum value of CARD is 1, and there is no row with ID2 as 5

SQL> explain select * from test10 where id1 = 5 and id2 = '5';

According to the execution plan in the above figure, the estimation of multiple columns is accurate, and there is no row that meets the two conditions

SQL> explain select * from test10 where id1 > 5;

According to the execution plan in the above figure, the single-column general conditions are estimated accurately, and 9995 ID1 > 5

SQL> explain select * from test10 where id1 > 5 and id2 = '5';

From the execution plan in the above figure, we can see that the multi-column mixed estimate is accurate, and there is no row that satisfies the condition

It can be seen that the collection of statistical information can revise the estimation of the number of filtered rows with a high probability.

Guess you like

Origin blog.csdn.net/qq_35273918/article/details/129841341