45 MySQL combat stress study notes: MySQL Why would sometimes choose the wrong index? (Lecture 10)

 First, this section Summary

Earlier we introduced the index, you already know a table in MySQL actually can support multiple indexes. But you write SQL statements, and did not take the initiative to specify which index to use. In other words, which index to use is determined by the MySQL.

I do not know you have not come across such a situation, one could execute very quickly the statement, but because MySQL wrong index, which led to the execution speed is very slow?

We look at an example of it.

Let's build a simple table, table has a, b two fields, and were built on the index:

CREATE TABLE `t` (
  `id` int(11) NOT NULL,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `a` (`a`),
  KEY `b` (`b`)
) ENGINE=InnoDB;

We then inserted into the table t 100,000 rows, the value is incremented by an integer, namely: (1,1,1), (2,2,2), (3,3,3) until (100000,100000, 100000).

I use a stored procedure to insert data, and here I posted allowing you to reproduce:

delimiter ;;
create procedure idata()
begin
  declare i int;
  set i=1;
  while(i<=100000)do
    insert into t values(i, i, i);
    set i=i+1;
  end while;
end;;
delimiter ;
call idata();

Next, we analyze a SQL statement:

mysql> select * from t where a between 10000 and 20000;

You will say that this statement also analyze it with, it is very simple, there is an index on a, be sure to use the index of a 

You're right, Figure 1 shows that the use of performance explain order to see this statement.

 Figure 1 Using explain command to view the implementation of the statement

From the looks of FIG. 1, the implementation of this query is indeed in line with expectations, key value of this field is 'a', the optimizer chooses the index represents a.

But wait, this case is not so simple. On our already prepared contains 100,000 lines of data tables, we do as follows.

And execution flow 2 session A session B in FIG.

Here, the operation session A of you are already familiar, and it is to open a transaction. Then, after the session B data are deleted, they call idata this stored procedure, insert 100,000 rows of data.

At this time, session B query select * from t where a between 10000 and 20000 will not choose the index a. We can see by the slow query log (slow log) about the specific implementation.

To illustrate the results chosen by the optimizer is correct, I added a control, namely: the use of force index (a) to allow the optimizer to use the index to force a (this part, I will mention in the second half of this article to).

The following three SQL statements, is this experiment.

set long_query_time=0;
select * from t where a between 10000 and 20000; /*Q1*/
select * from t force index(a) where a between 10000 and 20000;/*Q2*/

The first one, is to set the threshold value for the slow query log 0, the next statement means that the thread will be recorded slow query log;
second sentence, Ql is the original query session B;
third sentence, Q2 is added the force index (a) to compare and implementation of the session B original query.

Figure 3 is slow query log after the completion of the three SQL statement shown

FIG 3 slow log Results

You can see, Q1 scans 10 million lines, apparently gone full table scan, the execution time is 40 milliseconds. Q2 10001 scanned line, executed 21 milliseconds. In other words, when we do not use of force index, MySQL using the wrong index, resulting in a longer execution time.

This example corresponds to what we usually keep historical data and new data deleted scenes. At this time, MySQL will actually choose the wrong index, is not it strange? Today, we have the results from this strange talking about it.

Second, the optimizer logic

In the first article, we mentioned, is working to select an index optimizer.

The optimizer chooses index purpose is to find an optimal program implementation, and with minimal costs to execute the statement. In a database, the number of scanning lines is one of the factors affecting the implementation of the price. The fewer the number of lines scanned, means that the number of disk access
, the less number of data, the less CPU resource consumption.

Of course, the number of scanning lines is not the only criteria for judging whether the optimizer will combine the use of temporary tables, sorting and other factors whether a comprehensive judgment.

Our simple query is not related to the temporary table and sort, so MySQL is certainly the wrong index to determine when the number of scanning lines of a problem.

1, then the question is: number of scan lines is how to judge?

MySQL Before we can start executing the statement, and can not know exactly meet the conditions of this record how many, but only to estimate the number of records according to statistics.

This statistical information is an index of "discrimination." Obviously, the more an index on a different value, the discrimination index, the better. And an index on the number of different values, which we call "base" (cardinality). That is
to say, the bigger the base, the better the discrimination index.

We can use the show index method, see a base index. , T is the result showindex Table 4 shown in FIG. Although three field values for each row in this table are the same, but in the statistics, the three
base value indices and different, but in fact not accurate.

 The results in Table 4 show index t of FIG.

2, then, MySQL is how to get the base of the index it?

Here, I'll give you a brief statistical sampling method MySQL.

Why do statistical sampling? Because the entire table to take out line by line statistics, although you can get accurate results, but the price is too high, you can only choose "statistical sampling."

When sampling statistics, InnoDB is selected by default N pages of data, statistical different values ​​on these pages, get a mean value and then multiplied by the number of pages in the index, it has been the base of this index.

The data table will be continuously updated, the index statistics will not be fixed. So, when the number of changed data rows exceeds 1 / M, it will automatically trigger a re-do index statistics.

In MySQL, there are two ways to store the index statistics, can be selected by setting the value of the parameter innodb_stats_persistent:

  1. When set to on, the statistics would indicate persistent storage. In this case, the default is N 20, M 10.
  2. When set to off, only represents the statistical information stored in memory. In this case, the default is N 8, M 16.

Because it is statistical sampling, so no matter where N is 8 or 20, this base is very easy inaccurate. But that's not all.

As you can see from Figure 4, the index statistics (cardinality columns) Although less accurate, but still largely similar, the wrong index there must be other reasons.

In fact, only one input index statistics for a particular statement, the optimizer but also to determine, implementation of the sentence itself how many lines you want to scan.

Next, we'll take a look at the optimizer estimates the number of scanning lines of these two statements is.

 5 Unexpected explain results

rows This field indicates that the expected number of scanning lines.

Among them, the result was in line with expectations, the value of rows of Q1 is 104,620; however rows Q2 value is 37116, the deviation is big. And in Figure 1 we explain the command to see the rows are only 10001 lines, which is
a deviation misled the judge optimizer.

Here, it may be your first question is not why are not allowed, but the optimizer execution plan 37000 Why stood scanning lines do not, but the number of scan lines is selected to perform planned 100,000 it?

This is because, if the index a, each time to get a value from the index a, should be back to find out a whole row of data on the primary key index, the price of the optimizer should be enumerated.

And if the 100,000 row selection scan is scanned directly on the primary key index, with no additional costs.

The optimizer will estimate the cost of these two options, it seems from the results, the optimizer thinks direct primary key index scan faster. Of course, the execution time it seems that this choice is not optimal.

Use the general index back to the table of costs need to be included, in the implementation explain in Figure 1, also consider the cost of this strategy, but the choice of Figure 1 is right. In other words, this strategy is not a problem.

So injustice head main debt, MySQL wrong index, this child did not have to be attributed to accurately determine the number of scanning lines. As for why will get the wrong number of scanning lines, this reason as after-school issues, leaving you to analyze the.
Since it is not statistics, it would be corrected.

3, analyze table t command, the statistics can be used to re-index information.

We look at the implementation of the results.

FIG 6 performs explain the results of the recovery command analyze table t

This time right.

So in practice, if you find that explain the result of the estimated value of rows with relatively large gap between the actual situation, we can use this method to deal with.

In fact, if only index statistics are not accurate, we can solve many problems through analyze command, but we said earlier, the optimizer can not just look at the number of scan lines.

Is still based on the table t, we look at another statement:

mysql> select * from t where (a between 1 and 1000)  and (b between 50000 and 100000) order by b limit 1;

From the condition point of view, this query No records, and therefore will return an empty collection.

Before beginning this statement, you can first imagine, if you choose to index, will choose which one do?

For this analysis, we first look at a, b two index chart.

 

Fig 7 a, b index structure of FIG.


If a query index, then the former is a scan index value of 1000, and then taken to the corresponding id, isolated and then up the primary key index for each row, and then filtered based on the field b. Obviously this requires scanning 1000 lines.

If the query using the index b, then the value 50001 is the last scan of the index b, and the implementation process is the same as above, but also the need to return to the primary key index values ​​and then determine, it is necessary to scan 50001 lines.

So you would think that if a use of the index, it will be much faster execution speed significantly. So, let's take a look in the end is not all about children.

FIG 8 is performed explain the results.

mysql> explain select * from t where (a between 1 and 1000) and (b between 50000 and 100000) order by b limit 1;

 8 Use explain ways to view the execution plan 

You can see the result is returned in key field displays, the optimizer chooses the index b, and rows field shows the number of lines to be scanned is 50198. From this result, you can get two conclusions:

  • 1. The estimate of the number of scan lines is still not accurate;
  • 2. this case MySQL has the wrong index.

Third, the index selection and handling abnormal

In fact, most of the time optimizer can find the correct index, but occasionally you still encounter both cases our example above: the original can be executed very quickly in SQL statements, execution speed much slower than you expected, you How should I do it?

One method is the same as our first example, using a force index forcibly select index . MySQL will analyze the results of the analytical lexical index may be used as a candidate, then the candidate list in order to determine
each index how many rows need to be scanned. If the force index specified index in the index candidate list, to directly select the index is no longer assess the implementation of the cost of other indexes.

Let's look at a second example. At first analysis, we believe that a better selection index. Now, let's take a look at implementation of the results

9 Figure statements using different indexes to perform time-consuming

We can see, the original statements requires 2.23 seconds, and when you use force index (a) of only 0.05 seconds, 40 times faster than the optimizer's choice.

That is, the optimizer does not choose the correct index, force index acts as a "correction".

However, many programmers prefer not to use force index, not one to write so beautiful, and secondly, if the index change the name, the statement also have to change, it is very troublesome. And if later migrate to another database, this syntax may also not be compatible

Timeliness but in fact the use of force index main problem is change. Because the situation wrong index or less appear, so developers will not be the first time usually write force index. But wait until the problem of online
candidate, you will go to modify the SQL statement, plus the force index. But also tested and released after modification, for production systems, this process is not quick.

So the question is best database in an internal database to solve. So, in a database how to solve it?

Since the optimizer to give up using the index a, a description is not enough suitable, so the second method is that we can consider modifying statement, we expect to guide the use of MySQL indexes . For example, in this case, obviously the "Order by
B limit. 1" changed to "order by b, a limit 1 ", the logic is the same semantics. Let's look at the effect after the change:

FIG 10 order by b, a limit 1 Results of


Before the optimizer chooses to use the index b, because it avoids the sort that the use of the index b (b is the index itself, already ordered, and if you choose the index b, then do not need to sort, only need to traverse), so even scanning lines
a few more, but also determined to be less costly.

Now order by b, a such an approach, in accordance with the requirements of b, a sort, it means that the use of these two indexes require sorting. Therefore, the number of scanning lines became the main conditions affecting decisions, so this time optimizer chose to scan only needs 1000 index of the line a.

Of course, this modification is not common optimization methods, but there are just limit 1 In this statement there, so if you have to meet the conditions of recording, order by b limit 1 and order by b, a limit 1 will return b is the smallest that a
row, the same logic, we can do it.

If you feel that modify the semantics of this child is not very good, there's an improved method 11 is executed effect.

mysql> select * from  (select * from t where (a between 1 and 1000)  and (b between 50000 and 100000) order by b limit 100)alias limit 1;

Figure 11 explain rewrite SQL


In this example, we limit 100 allows the optimizer to realize that the cost of using the b index is high. In fact, according to data characteristics we induce a bit optimizer does not have the versatility.

The third method is, in some scenes, we can create a more appropriate index to be provided to the optimizer to choose, or delete an index misuse.

However, in this case, I have no way to change the behavior by the optimizer to find the new index. In fact, this situation is relatively small, especially after the DBA index optimized library, and then run into this bug, find a more appropriate index is generally more difficult.

If I say there is a method to delete the index b, you may feel funny. But in fact I have come across two such examples, the DBA is ultimately communicate with business development, we found that the selection of index optimization error in fact there is no will
to exist, so he deleted this index, the optimizer will re to choose the correct index.

IV Summary

Today, we chatted chatted index statistics update mechanism, and referred to the possibility of the wrong index optimizer. For problems caused by inaccurate results in Index statistics, you can analyze table to resolve.

As for the other cases of miscarriage of justice optimizer, you may be forced to specify the index used in the application side force index, the optimizer can also be guided through the modification statements, you can also add or delete an index to circumvent this problem.

You might say, a few examples later in this article today, how not to expand a description of its principles. I want to tell you is that today's topic, we are faced with the MySQL bug, each must expand deep into the code line by line to
quantify, it is not something we should do here.

So, I put my solutions used to share with you, hope you are when confronted with a similar situation, to have some ideas.

When you usually deal with in the MySQL optimizer bug is there any other ways, also sent to the comments section to share it. Finally, I leave you to think a problem. In front of us during the construction of the first example by session A of
cooperation, so that session B and then delete the data re-insert the data again, and then explain the results found in, rows from the field 10001 become more than 37,000.

And if not with the session A, but performed separately delete from t, call idata () , explain these three sentences, you will see rows field, but it is still about 10,000. You can yourself verify the results.
This is what causes it? Even if you analyze it.

You can put your analysis conclusions written in the comments section, I will end next article and you discuss this issue. Thank you for listening, you are welcome to send this share to more friends to read together.

Fifth, on the issue of time

In the last article I leave you with the last question is, if a write-once use change buffer mechanism, after the host reboots, will change buffer and data loss.

The answer to this question is not lost, the message area many students have answered right. Although memory is only updated, but when the transaction is committed, we change buffer operation is also recorded in the redo log, so the crash recovery
time, change buffer can get it back.

In the comments section and some students asked whether the process will merge the data is written directly to disk, which is a good question. Here, I would like for you to analyze.

Execution flow merge is this:

1. From the disk to read data pages into memory (older versions of data pages);
2. Find the change buffer records (there may be more) from this page of data change buffer, the application in order to obtain the new data page;
3. write redo log. The redo log contains a change of change and change buffer data.

Here merge process is over. At this time, the data page and the corresponding change buffer memory disk location have not modified, are dirty pages, brush back after their own physical data, the process is another.

Six classic message

Today the question is not particularly understand why. opened the session A consistent read, session B delete or insert, it has been put into the record before the undo. Secondary index records also written into the redo and change buffer, it should be said to delete the duplicate read index pages does not affect the session A's. After the estimate is on the consistency reading, during the execution of this transaction can not free up space, resulting in larger statistical information. The teacher still need to explain specific details

Today, there are two questions, I would like to ask the teacher

1. My understanding is due to B is to find (50000,100000), due to the B + tree ordered by binary search to find the value of b = 50000, and 50000 from the right scan, check the data back to the table one by one, to do on the actuator where a (1,1000) filter, and then determines whether or not enough to make the limit number, it is enough to end the cycle. Because here b (50000,100000) bound to a (1,1000) does not exist, it is necessary to scan about 5W line. But if a read (50001,51000), number of scanning lines is not changed. It is because the number of scanning lines optimizer to the problem or does not perform end of the cycle? Why is not the end of the cycle?
(Like rows can visualize limit work, you must filter the data on the actuator, can not filter the data on the index, do not know why this design)

2. Assuming there is a lot of data b duplicate data, b is the maximum value of a plurality of rows there are repeated
select * from t where (a between 1 and 1000) and (b between 50000 and 100000) order by b desc limit 1;
here descending b to scan index tree, select the maximum value of b, id value is a fixed value (neither maximum nor minimum)
sELECT * from Force index T (a) wHERE (BETWEEN. 1 and a 1000) and (b BETWEEN 100000 and 50000) Order by desc limit b. 1;
. Since this is a selected index, can not use the index sorting, sorting only with optimal selection of the maximum value of b, id the line minimum value
which is typical of two identical the sql, but the choice of different indexes, data inconsistencies arise.
So if order by b, a cause of the inconsistency can be avoided this situation can be avoided heap sort resulting from inconsistent
but if asc is no such situation. Here appears inconsistent, it should not be caused due heap sort. This is what causes?

Author Replies:

1. Good question, and you made a good control experiments. Yes, add a limit to reduce the number of scanning lines 1, the optimizer is not sure actually, have to implement [know], so when the show is still in accordance with "the largest possible number of scan lines" to display.

2. you this example, if b is indeed in accordance with the scan, it should certainly maximum ID unless the ID biggest record, a condition is not satisfied. But it must be under "satisfy a condition of which the largest one ID," you revalidation.

And if it is used a, then there is a temporary table sorting, ordering a temporary table there are three algorithms, but also sub-memory or disk temporary tables ... not to expand here, behind "order by how it all works." This speaks

Guess you like

Origin www.cnblogs.com/luoahong/p/11614894.html