45 MySQL combat stress study notes: join statement how optimization (say 35)

First, this section provides an overview

In the previous article, I introduced you two algorithms join statement, respectively Index Nested-LoopJoin (NLJ) and Block Nested-Loop Join (BNL).

We found that when using NLJ algorithm, in fact, the effect is good, more convenient than the application layer into multiple statements and then stitching the query results, and the performance will not be bad.

However, BNL algorithm performance when large tables join more backward, compare the number of times equal to the product of the two tables involved in the join line, it consumes CPU resources.

Of course, these two algorithms as well as continue to optimize the space, we are today to talk about this topic.

For ease of analysis, I created two tables t1, t2, and you started to today's problems.

For convenience of explanation later quantization, my table t1, the insertion of the 1000 rows, each row a = value of 1001-id. That is, a field in a table t1 in the reverse order. At the same time, I inserted one million rows of data in table t2.

create table t1(id int primary key, a int, b int, index(a));
create table t2 like t1;
drop procedure idata;
delimiter ;;
create procedure idata()
begin
  declare i int;
  set i=1;
  while(i<=1000)do
    insert into t1 values(i, 1001-i, i);
    set i=i+1;
  end while;
  
  set i=1;
  while(i<=1000000)do
    insert into t2 values(i, i, i);
    set i=i+1;
  end while;

end;;
delimiter ;
call idata();

Two, Multi-Range Read optimization

1, the basic flow back to the table

Before optimization program introduced join statement, and I need to introduce you to a point of knowledge, namely: Multi-RangeRead optimization (MRR). The main objective is to optimize the use of sequential read the disk.

In the fourth article, I introduce you to InnoDB's index structure, referred to the concept of "back to the table". Let's take a look at this concept. Refers back to the table, InnoDB found primary key value in the ordinary index id of a post, then in accordance with a
primary key value of the id to the primary key index to check up the entire process line data.

Then, some students asked in the message area, the process is back to the table to check data line by line, or in bulk check data? Let's take a look at this issue. Suppose, I execute this statement:

select * from t1 where a>=1 and a<=100;

Primary key index is a B + tree, this tree, each row of data can be found in accordance with a primary key id. So, back to the table rows must be the primary key index of the search,

The basic process shown in Figure 1.

Figure 1 Basic flow back to the table

As if the value of a query, then ascending order, id value becomes random, then there will be a random access performance is relatively poor. Although the "rows check" this mechanism can not be changed, but adjust the order of the query, or to accelerate.

2, MRR execution flow

Because most of the data are in accordance with the primary key is inserted in increasing order to get, so we can assume that, if the query words in ascending order of the primary key, read the disk closer reading order can improve read performance.

This is the design ideas MRR optimization. In this case, the execution flow statement has become such:

1. The index a, is positioned to satisfy the condition of the recording, in the id value into read_rnd_buffer;
2. read_rnd_buffer id performed in ascending order;
id array 3. After sorting, to sequentially search the primary key index record id, and returned as the result.

Here, the size is controlled by read_rnd_buffer_size read_rnd_buffer parameters. If step 1, read_rnd_buffer filled, it will perform the first complete steps 2 and 3, and then empty the read_rnd_buffer. After
continued trying to find a record of the index, and the cycle continues.

Also need to note is that if you want to optimize the stable use of MRR, then you need to set setoptimizer_switch = "mrr_cost_based = off". (Official document to say, now that the optimizer strategy, determine when consumed,
will be more inclined not to use MRR, the mrr_cost_based set to off, is fixed using the MRR.)

The following two figures is the use of the implementation process and explain the results of MRR optimization.

FIG 2 MRR execution flow

3, MRR explain the results of the implementation process

The results explain the flow of execution in FIG. 3 MRR

From explain the results of Figure 3, we can see the Extra field more Using MRR, MRR represents the spend optimization. Furthermore, since we did in read_rnd_buffer the sorted id, so the resulting junction
results are set according to the primary key id ascending order, i.e. the reverse order of rows in the results of FIG.

Here, we summarize briefly.

The core can improve the performance of MRR that this query to do on the index is a range of a query (that is to say, this is a multi-value query), you can get enough primary key id. After sorting through this, go to the primary key index
to check data in order to demonstrate the advantages of "order" of the.

三、Batched Key Access

1, Index Nested-Loop Join flowchart

MRR understand the principles of performance, we can understand the MySQL 5.6 version began after the introduction of BatchedKey Acess (BKA) the algorithm. The BKA algorithm, in fact, NLJ optimized algorithm.

We look at the flow chart NLJ algorithm used in the previous article:

FIG 4 Index Nested-Loop Join flowchart

Logic algorithm executed NLJ are: from the driving table t1, the value of a taken line by line, to the drive table t2 do is join. That is, for table t2, the value every time a match. At this time, the advantages of MRR would be irrelevant.

Then how can a one-time pass more and more value to the table t2 it? The method is, in one time t1 from the table to take more lines out pass along the table t2.

2, Batched Key Acess process

That being the case, we put the data table t1 taken out part first into a temporary memory. This temporary memory is not others, it is join_buffer.

By the previous article, we know the role of join_buffer at BNL in the algorithm, it is staging data-driven table. But we did not use the algorithm in NLJ. So, we just can reuse join_buffer to BKA algorithm.

5, is a flow chart of the algorithm BKA NLJ above optimization algorithm.

FIG 5 Batched Key Acess process

Figure I put in join_buffer data is P1 ~ P100, represents the field will only take queries need. Of course, if all of the data does not fit join buffer P1 ~ P100, the line 100 will put the data into a plurality of segments to perform the
process on the FIG.

Well, the BKA algorithm in the end how to enable it?

If you are using BKA optimization algorithm, you need to execute SQL statements before, first set

set optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

Fourth, the performance issues BNL algorithm

Having NLJ optimization algorithm, we will look at BNL optimization algorithm.

At the end of my previous articles, you leave thinking problem is that when the algorithm Block Nested-Loop Join (BNL) , the table might be driven to do multiple scans. If this is the driving table is a large cold data table, in addition to lead to
large IO outside pressure, there will be any impact on the system?

1, if this is the driving table is a large cold data table, in addition to outside pressure will lead to big O, there will be any impact on the system?

In the first 33 articles, we talk LRU algorithm of InnoDB, when mentioned, due to InnoDB LRU algorithm BuffferPool of optimized, namely: first read from disk into memory data pages, will be the first on the old area .
If after one second page of this data is no longer accessible, it will not be moved to the head of the LRU chain, so the impact on BufferPool hit rate would be small.

However, if you use a join statement BNL algorithm, multiple scans of a cold table, but this time more than one second statement is executed, it will re-scan the table when cold, the cold table's data pages moved to the head of the LRU chain.

This case corresponds to, the amount of data is less than the entire cold Buffer Pool Table 3/8, the case can be fully inserted in the old region.

If this is a big cold table, there will be another situation: normal access to business data pages, lack of access to young area.

Due to the optimization mechanism, a normally accessed data pages, to enter the region young, after one second interval need to be accessed again. However, due to join us in the loop statement read out of memory and disk page, enter the old data area of the
page, probably within one second to be eliminated. In this way, it will lead to the MySQL instance Buffer Pool During this time, the data page young area is not properly eliminated.

In other words, both of which will affect the normal operation of the Buffer Pool.

2, the impact on the system BNL algorithm mainly includes three aspects:

Although large table join operation IO impact, but at the end of the statement is executed, the impact on the IO will be over. However, in order to reduce this effect, you can consider increasing the value of join_buffer_size reduce the number of driven table scan.

In other words, the impact on the system BNL algorithm mainly includes three aspects:

1. multiple scans may be driven table, taking up disk IO resources;
2. determine join conditions need to perform comparative M * N times (M, N are the number of rows in two tables), if it will take up a very large table more CPU resources;
3. heat may cause data Buffer Pool is eliminated, affecting memory hit rate.

Before we execute the statement, required by means of theoretical analysis and view explain the results, confirm that you want to use BNL algorithm. If confirmed optimizer uses BNL algorithm, we need to do optimization. Common practice optimization is to be driven table
join field plus an index, the BNL algorithm turn into BKA algorithm.

Next, we look at the specific, this optimization how to do?

Five, BNL turn BKA

In some cases, we can build the index to be driven directly on the table, then you can turn directly to the BKA algorithm.

1, the situation is not suitable for building the index on the table if you use BNL driven algorithms to join, then the flow of execution of this statement is this:

But sometimes you do come across some cases not suitable for building an index on the table to be driven. For example, the following statement:

select * from t1 join t2 on (t1.b=t2.b) where t2.b>=1 and t2.b<=2000;

When we started the article, insert one million rows of data in table t2, but after filtration where conditions require the participation of only 2,000 lines of data join. If this statement is also a low frequency of SQL statement, then
this statement creates an index on the field b table t2 is very wasted.

However, if BNL algorithm to join the case, the flow of execution of this statement is this:

1. All fields of the table t1 is taken out, stored in join_buffer. This table is only 1000 lines, join_buffer_size default value is 256k, can be completely stored.

2. The scan table T2, remove each row of data in comparison with join_buffer,

    • If not t1.b = t2.b, skip;
    • If the condition t1.b = t2.b, then determining other conditions, i.e. meets t2.b in [1,2000] conditions,
    • If yes, as part of the result set returned, otherwise skip

I said in the previous article, for each row in the table t2 judge whether join when met, will need to traverse all the lines join_buffer. It is judged that the number of equivalent conditions is 1000 * 1 million = 10 million times, the judge
large workload.

Figure 6 explain results

7 statement execution time

See, explain the results in Extra field shows the use of BNL algorithm. In my test environment, this statement needs to be performed 1 minute 11 seconds.

2, using a temporary table (let join statement can spend on the index driven table to trigger algorithm BKA)

Create an index would be a waste of resources in the field b table t2, but does not create the index, then the equivalent of a conditional statement to determine one billion times, think it is a waste. Well, there is no way to do best of both worlds?

At this time, we can consider the use of temporary tables. The general idea is to use a temporary table:

1. The table t2 satisfy the condition data in the temporary table tmp_t; and
2. In order to make use BKA join algorithm, to the field b temporary table index plus tmp_t;
3. tmp_t do so tables t1 and join operations.

In this case, the wording corresponding SQL statement is as follows:

create temporary table temp_t(id int primary key, a int, b int, index(b))engine=innodb;
insert into temp_t select * from t2 where b>=1 and b<=2000;
select * from t1 join temp_t on (t1.b=temp_t.b);

Figure 8 is the implementation of the results of this sequence of statements.

FIG 8 is performed using the effect of temporary table

Can be seen, the sum of the whole process three statement execution time less than one second, compared to the preceding 1 minute 11 seconds, the performance has been greatly improved. Next, we look at consumption with this process:

  • 1. Run the insert statement configured temp_t table and insert the data in table t2 do full table scan, the number of scanning lines here 1,000,000.
  • 2. join statement after the scan table t1, where the number of scanning lines is 1000; join the comparison process, we made 1,000 indexed queries. Compared to the join statement before optimization need to do a billion conditional, this optimization effect is obvious.

Overall, whether it is applied to the original table index, with the index or temporary tables, our idea is to let join statement can spend on the index driven table to trigger BKA algorithm to improve query performance.

Sixth, expand -hash join

See here you may find that the 10 billion times that the operation actually calculated above, looks a bit silly. If join_buffer which is not maintained by a disordered array, but a hash table, then it is not 10 billion times sentenced to
break, but a million times hash lookup. In this case, the execution speed of the entire statement is much faster, right?

Indeed.

This, it is also one of the reasons MySQL optimizer and actuators has been criticized: do not support the hash join. And, MySQL official roadmap, also has yet to put this optimization rafts agenda.

In fact, this optimization idea, we can achieve in their own business side. Implementation process is as follows:

1. select * from t1; all 1000 data acquisition row of table t1, a hash stored in the business end of the structure, such as in C ++ set, PHP array such data structures.
2. select * from t2 where b> = 1 and b <= 2000; 2000 acquires rows in table t2 satisfy the condition.
3. this data line 2000, line by line to get the business end, the hash table data structure for a matching data. This data row matching condition is satisfied, it is set as a row of the result.

In theory, this process will be faster than the speed of execution of some of the temporary table program. If you are interested, you can verify it yourself.

VII Summary

Today, I share with you the Index Nested-Loop Join (NLJ) and Block Nested-LoopJoin (BNL) optimization method

In these optimization methods:

1. BKA is optimized MySQL has built-in support, it is recommended that you use the default;
low 2. BNL algorithm efficiency, we recommend you try to turn into BKA algorithm. Optimization field direction is driven to the associated index table is coupled;
3. Based on the development of a temporary table, can be filtered out in advance for small data join statement, the effect is very good;
4. the current version of the MySQL It does not support the hash join, but you can come out with their own simulation application side, in theory, the program is better than the temporary table.

Finally, I leave you with a question to think of it.

We are talking about join statement of these two articles, only related to the join of the two tables. Well, now there is a demand for three table join, assuming that these three tables table structure is as follows:

CREATE TABLE `t1` (
 `id` int(11) NOT NULL,
 `a` int(11) DEFAULT NULL,
 `b` int(11) DEFAULT NULL,
 `c` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

create table t2 like t1;
create table t3 like t2;
insert into ... //初始化三张表的数据

We need to achieve the following statement of join logic:

select * from t1 join t2 on(t1.a=t2.a) join t3 on (t2.b=t3.b) where t1.c>=X and t2.c>=Y and t3.c>=Z;

Now in order to get the fastest execution speed, if let you design tables t1, t2, the index on t3, to support this join statement, which indexes you add it?

Meanwhile, if you want me to rewrite this statement with straight_join, with the index you create, you will need to arrange join order, what factors do you consider is the main?

You can write your programs and analysis in the comments section, I will end next article and you discuss this issue. Thank you for listening, you are welcome to send this share to more friends to read together.

Eight, the issue of time

In the last article I left you last question has been answered in this article.

Here I am again under the circumstances the comments section of the message, a brief summary below. Depending on the size of the amount of data, so there are two cases:

@ @ Long Jie and Yang comrades mentioned old data amount is smaller than the area of ​​memory;

@Zzz students very seriously the comments of other students, and raised a deep problem. Driven table for the amount of data in Buffer Pool scene, made a very careful analysis and deduction.

These students point to praise, very good thought and discussion

Guess you like

Origin www.cnblogs.com/luoahong/p/11752367.html