This article takes you to understand the cost-based optimization of MySQL

foreword

We used to say that MySQL can have different execution schemes for executing a query, and it will choose the scheme with the lowest cost or the lowest cost to actually execute the query. How can I take you to understand it in detail?

1. What is cost

We used to say that MySQL can have different execution plans for executing a query, and it will choose one of them 成本最低, or 代价最低that plan to actually execute the query. However, our previous description of the cost is very vague. In fact, the execution cost of a query statement in MySQL is composed of the following two aspects:

  • I/O成本

    The MyISAM and InnoDB storage engines that our tables often use store both data and indexes on the disk. When we want to query the records in the table, we need to load the data or indexes into memory before operating. The time lost in the process of loading from disk to memory is called I/O cost.

  • CPU成本

    The time consumed by operations such as reading and checking whether records meet the corresponding search conditions and sorting the result set is called CPU cost.

For the InnoDB storage engine, a page is the basic unit of interaction between disk and memory. MySQL stipulates that the default cost for reading a page is 0.25(MySQL 5.7 default 1.0), and the default cost for reading and checking whether a record meets the search criteria is 0.1(MySQL 5.7 default 0.2). 0.25And 0.1these numbers are called 成本常数, these two cost constants are the most commonly used, and we will talk about the rest of the cost constants later.

Tip:
It should be noted that the cost is 0.1 regardless of whether it needs to check whether the search condition is met when reading the record.
The MySQL version used here is 8.0.32, and the cost will vary between versions. It is explained in detail later in this chapter.

2. The cost of single table query

2.1 Prepare data

For our normal study, we still use the previous one demo8. I am afraid that everyone will forget what this watch looks like, so I will copy an article for you:

mysql> USE testdb;

mysql> create table demo8 (    
id int not null auto_increment,    
key1 varchar(100),    
key2 int,    
key3 varchar(100),    
key_part1 varchar(100),    
key_part2 varchar(100),    
key_part3 varchar(100),    
common_field varchar(100), 
primary key (id),
key idx_key1 (key1),    
unique key idx_key2 (key2),    
key idx_key3 (key3),    
key idx_key_part(key_part1, key_part2, key_part3));

A total of 1 clustered (primary key) index and 4 secondary indexes have been created for the demo8 table:

  • idThe clustered index created for the column;
  • key1A secondary index created for the column;
  • key2A unique secondary index created for the column;
  • key3A secondary index created for the column;
  • A compound (joint) secondary index created for the key_part1, key_part2, columns.key_part3

Then we need to insert 20000a record for this table, and insert random values ​​into the other columns except the id column.

mysql> delimiter //
create procedure demo8data()
begin    
	declare i int;    
	set i=0;    
	while i<20000 do        
		insert into demo8(key1,key2,key3,key_part1,key_part2,key_part3,common_field) values(
		substring(md5(rand()),1,2),
		i+1,
		substring(md5(rand()),1,3),
		substring(md5(rand()),1,4),
		substring(md5(rand()),1,5),
		substring(md5(rand()),1,6),
		substring(md5(rand()),1,7)
		);        
		set i=i+1;    
	end while;
end;
//
delimiter ;

mysql> call demo8data();

Let's officially start our study

2.2 Cost-based optimization steps

Before a single-table query statement is actually executed, the MySQL query optimizer will find out all possible solutions for executing the statement, and after comparison, find the solution with the lowest cost. This solution with the lowest cost is the so-called execution plan, and then it will Call the interface provided by the storage engine to actually execute the query. The summary of this process is as follows:

  • Find all possible indexes to use based on the search criteria
  • Calculate the cost of a full table scan
  • Calculate the cost of executing a query using different indexes
  • Compare the costs of various execution plans to find the one with the lowest cost

Below we use an example to analyze these steps. The single-table query statement is as follows:

mysql> select * from demo8 where
	key1 in ('aa','bb','cc') and
	key2 > 10 and key2 < 1000 and
	key3 > key2 and
	key_part1 like '%3f%' and
	common_field='1281259';

Does it look complicated, let's analyze it step by step

Step 1: Find all possible applicable indexes according to the search criteria

As we said before, for the B+ tree index, as long as the index column and the constant are connected using =, <=>, IN, , NOT IN, IS NULL, IS NOT NULL, >, <>=, <=, BETWEEN AND, !=(not equal to can also be written <>) or LIKEoperators, a so-called The range interval (LIKE matches the string prefix is ​​also OK), that is to say, these search conditions may use indexes, and MySQL calls the indexes that may be used in a query possible keys.

Let's analyze several authorization conditions involved in the above query:

  • key1 in ('aa','bb','cc'), this search condition can use the secondary indexidx_key1
  • key2 > 10 and key2 < 1000, this search condition can use the secondary indexidx_key2
  • key3 > key2, because the search column of this search condition is not compared with a constant, the index cannot be used.
  • key_part1 like '%3f%', key_part1use likeoperators to compare strings starting with wildcards, and indexes cannot be used
  • common_field=‘1281259’, since the column has no index at all, the index will not be used

To sum up, the indexes that may be used in the above query statement are only possible keyssum idx_key1and index idx_key2.

Step 2: Calculate the cost of full table scan

For the InnoDB storage engine, full table scan means to compare the records in the clustered index with the given search conditions in turn, and add the records that meet the search conditions to the result set, so the clustered index needs to be corresponding The page is loaded into memory and then checked to see if the record matches the search criteria. Since 查询成本=I/O成本+CPU成本, calculating the cost of a full table scan requires two pieces of information:

  • 聚簇索引占用的页面数
  • 该表中的记录数

Where do these two pieces of information come from? MySQL maintains a series of statistical information for each table. We will explain in detail how these statistical information are collected later in this chapter. Now let's see how to view these statistical information. MySQL provides us with statements to view the statistical information of a table. If we want to view the statistical information of a specified table, just add the corresponding statement show table statusafter the statement. For example, if we want to view the statistical information of this table, we can write it like this :
likedemo8

mysql> show table status like 'demo8' \G;
*************************** 1. row ***************************
           Name: demo8
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 20187
 Avg_row_length: 78
    Data_length: 1589248
Max_data_length: 0
   Index_length: 2785280
      Data_free: 4194304
 Auto_increment: 20001
    Create_time: 2023-05-16 16:36:53
    Update_time: 2023-05-16 16:38:21
     Check_time: NULL
      Collation: utf8mb4_0900_ai_ci
       Checksum: NULL
 Create_options: 
        Comment: 
1 row in set (0.00 sec)

ERROR: 
No query specified

While a lot of stat options come up, we only care about two for now:

  • Rows: This option indicates the number of records in the table. For MyISAMtables using storage engines, this value is exact, and for InnoDBtables using storage engines, this value is an estimate. As can be seen from the query results, since our demo8table uses InnoDBa storage engine, although there are actually 20,000 records in the table, the show table statusdisplayed Rowsvalue is 20,187 records.

  • Data_length: This option indicates the number of bytes of storage space occupied. For a table using MyISAMa storage engine, this value is the size of the data file. For a InnoDBtable using a storage engine, this value is equivalent to the size of the storage space occupied by the clustered index

    That is to say, the size of the value can be calculated like this:

    Data_length = 聚簇索引的页面数量 * 每个页面的大小

    Our demo8table uses the default 16KBpage size, according to the above query results, so we can calculate the number of pages in the clustered index:

    聚簇索引的页面数 = 1589248 ÷ 16 ÷ 1024 = 97

Now that we have obtained the estimated number of pages occupied by the clustered index and the number of records in the table, we can now look at the calculation process of the full table scan:

  • I/OCost: 97 * 0.25 = 24.25
    97refers to the number of pages occupied by the clustered index, 0.25refers to the cost constant of loading a page
  • CPUCost: 20187 * 0.1 = 2018.7
    20187refers to the number of records in the table in the statistical data, which InnoDBis an estimated value for the storage engine, 0.1and refers to the cost constant required to access a record
  • 总成本:24.25 +2018.7 = 2042.95

To sum up, demo8the total cost required for the full table scan is 2042.95to directly upload the code, no nonsense, and a gorgeous verification

mysql> explain format=json select * from demo8 ;

insert image description here

小提示:
We said earlier that the records in the table are actually stored in the leaf nodes of the clustered index corresponding to the B+ tree, so as long as we obtain the leftmost leaf node through the root node, we can follow the doubly linked list composed of leaf nodes. Check them all. That is to say, in the process of full table scanning, some nodes in the B+ tree do not need to be accessed, but MySQL directly uses the number of pages occupied by the clustered index as the basis for calculating the I/O cost when calculating the cost of full table scanning. The distinction between internal nodes and leaf nodes is a bit simplistic, just pay attention to it.

Step 3: Calculate the query cost executed by different indexes

From the analysis in step 1, we get that the above query may use idx_key1these idx_key2two indexes. We need to analyze the cost of using these indexes alone to execute the query, and finally analyze whether it is possible to use index merging. What needs to be mentioned here is that the MySQL query optimizer first analyzes the cost of using a unique secondary index, and then analyzes the cost of using a common index, so we also analyze the cost first, and then look at the idx_key2cost idx_key1of use.

Cost of queries performed using idx_key2

idx_key2The corresponding search condition is: key2 > 10 and key2 < 1000, that is to say, the corresponding range interval is: ( 10, 1000), and idx_key2the search diagram is as follows:

insert image description here

For 二级索引+回表the query of the method, MySQL calculates the cost of this query depends on two aspects of data:

  • 范围区间的数量:No matter how many pages are occupied by the secondary index in a certain range, the query optimizer roughly considers it 读取索引的一个范围区间的I/O成本和读取一个页面是相同的. In this example, there is only one range using idx_key2: ( 10,1000), so the equivalent of accessing the secondary index of this range I/O成本is:1 * 0.25 = 0.25

  • 需要回表的记录数:The optimizer needs to calculate how many records are contained in a certain range of the secondary index. For the ratio, it is necessary to calculate 10,1000how many secondary index records idx_key2 contains in the range ( ). The calculation process is as follows:

    • Step 1: First key2 > 10visit idx_key2the corresponding B+ tree index according to this condition, and find key2 > 10the first record that meets this condition. We call this record the leftmost record of the interval. We said earlier that the process of locating a record in the B+ number tree is extremely fast and constant, so the performance consumption of this process is negligible

    • Step 2: Then continue to find the first record that meets this condition key2 < 1000from idx_key2the corresponding B+ tree index according to this condition. We call this record the rightmost record in the interval, and the performance consumption of this process is also negligible .

    • Step 3: If the interval between the leftmost record and the rightmost record in the interval is not too far apart (in MySQL 5.7.21, as long as the interval is not less than 10 pages), you can accurately count the secondary indexes that meet the key2> 10 AND key2 < 1000conditions The number of records. Otherwise, just read 10 pages to the right along the leftmost record in the interval, calculate the average number of records contained in each page, and then multiply this average by the number of pages between the leftmost record and the rightmost record in the interval. Then the question comes again, how to estimate how many pages there are between the leftmost record in the interval and the rightmost record in the interval? To solve this problem, we have to go back to the structure of the B+ tree index:

      insert image description here

    • As shown in the figure, for example, the leftmost record in the interval is on page 34, and the rightmost record in the interval is on page 40. Then I want to calculate the number of pages between the leftmost record in the interval and the rightmost record in the interval, which is equivalent to calculating the number between page 34 and page 40. How many pages are there, and each directory entry record corresponds to a data page, so calculating the number of pages between page 34 and page 41 is equivalent to calculating the distance between the corresponding directory entry records in their parent nodes (that is, page 42) Wouldn't it be enough to have a few records? If there are too many pages before page 34 and page 41 (the corresponding directory items are not in the same parent node page), then continue to recursively count. This statistical process is carried out on the parent node page. We said before that a B+ tree has 4 The layer height is already very high, so it is not very performance-consuming.

After knowing how to count the number of records in a certain range of the secondary index, it is necessary to return to the real problem. According to the above algorithm, there are about one record idx_key2in the interval ( ). The CPU cost to read this secondary index record is:10, 1000989989989 * 0.1 = 98.9

Where 989is the number of secondary index records that need to be read, 0.1and is the cost constant for reading a record

After obtaining the records through the secondary index, two more things need to be done:

  • According to the primary key value in these records, go back to the clustered index to perform table operations

    Here you need to take a closer look. MySQL’s evaluation of the I/O cost of the table return operation is still very bold. They think that each table return operation is equivalent to accessing a page, that is to say, how many records there are in the range of the secondary index. How many times to return to the table, that is, how many page I/Os need to be performed. According to the above statistics, when using the idx_key2 secondary index to perform queries, it is estimated that 989 secondary index records need to be returned to the table. The I/O cost caused by the table return operation is:989 x 0.25 = 247.25

    where 989is the expected number of secondary index records and 0.25is a constant for the I/O cost of a page.

  • The complete account record obtained after returning to the table, and then check whether other search conditions are true

    The essence of the table return operation is to find the complete user record in the clustered index through the primary key value of the secondary index record, and then check whether the search conditions key2 > 10 and key2 < 1000other than this search condition are true. Because we obtained a total of 989 secondary index records through the range interval, which corresponds to a complete user record in the clustered index 989, the CPU cost of reading and checking whether these complete user records meet the rest of the search conditions is as follows:989 x 0.1 = 98.9

    Among them, 989 is the number of records to be detected, 0.1which is the cost constant for detecting whether a record meets the given search conditions

So the cost of executing a query using idx_key2 in this example is as follows:

  • I/O成本: 1.0x 0.25 + 989 x 0.25 = 247.5(number of range intervals + estimated number of secondary index records)

  • CPU成本: 989 x 0.1 + 0.01 + 989 x 0.1 = 197.81(The cost of reading the secondary index records + the cost of reading and detecting the clustered index records after returning to the table)

To sum up, the total cost of using idx_key2 to execute the query is: 247.5 + 197.81 = 445.31, directly upload the code, no nonsense, a gorgeous verification:

mysql> explain format=json select * from demo8 where key2 > 10 and key2 < 1000 and key3 > key2 and key_part1 like '%3f%' and common_field='1281259';

insert image description here

小提示:
If you use the index, there is fine-tuning of the conditions for reading the secondary index, but not for reading the clustered index. Any record in the scanning interval and back to the table is equivalent to reading one page. If the index is not used, the fine-tuning value will be analyzed separately, which is different from the former.

Cost of queries performed using idx_key2

idx_key1The corresponding search condition is key1 in ('aa','bb','cc')also equivalent to three single-point intervals:

  • ['aa','aa']
  • ['bb','bb']
  • ['cc','cc']

The schematic diagram of using idx_key1search is as follows:

insert image description here

Similar to the use idx_key2case, we also need idx_key1the number of range intervals that need to be accessed and the number of records that need to be returned to the table.

  • 范围区间数量idx_key1: There are obviously three single-point intervals when using the query, so I/Othe cost of accessing the secondary index of these three range intervals is3 x 0.25 = 0.75
  • 需要回表的记录数
    • ['aa','aa']Searching for the number of secondary index records corresponding to a single-point interval is the same as searching for the number of secondary index records corresponding to a continuous range interval. Both the leftmost record and the rightmost record of the interval are calculated first, and then the number of records between them is calculated . , the specific method has already been talked about, so I won’t talk about it anymore, and finally we get that the single-point interval ['aa','aa']secondary index record is:67
    • ['bb','bb']The secondary index record corresponding to the single-point interval search is:88
    • ['cc','cc']The secondary index record corresponding to the single-point interval search is:75

Therefore, the total number of records that need to be returned to the table for these three single-point intervals is: 67+88+75 = 230, and the cost of reading these secondary index records CPUis:230 x 0.1 + 0.01 = 23.01

After obtaining the total number of records that need to be returned to the table, consider:

  • According to the primary key value in these records, the table operation is performed in the clustered index, and the required I/Ocost is:230 x 0.25 = 57.5
  • After returning to the table, the complete user record is obtained, and CPUthe cost corresponding to this step of comparing whether other search conditions are true is:230 x 0.1 = 23

So the cost of executing a query using idx_key1 in this example is as follows:

  • I/O成本:0.75 + 57.5 = 58.25
  • CPU成本:23 + 23.01 = 46.01

To sum up, the total cost of executing a query using idx_key1 is: 58.25 + 46.01 = 104.26, directly upload the code, no nonsense, a gorgeous verification:

mysql> explain format=json select * from demo8 where key1 in ('aa','bb','cc') and key2 > 10 and key2 < 1000 and key3 > key2 and key_part1 like '%3f%' and common_field='1281259';

insert image description here
Is it possible to use Index Merge

In this example, the search conditions key1for and are connected by using concatenation, while for and are range queries, that is to say, the found non-clustered index records are not sorted according to the primary key value, and do not meet the conditions for using index merging , so no index merge will be used.key2ANDidx_key1idk_key2Intersection

小提示:
The algorithm used by the MySQL query optimizer to calculate the cost of index merging is also cumbersome. I won’t talk about it here. Just understand how the cost is calculated and know that MySQL will select the index according to this algorithm.

Step 4: Compare the costs of various execution plans and find the one with the lowest cost

The various executable schemes for executing the query in this example and their corresponding costs are listed below:

  • 全表扫描the cost of:2042.95
  • idx_key2Costs used :445.31
  • idx_key1Costs used :104.26

Obviously, idx_key1the lowest cost to use, so of course choose idx_key1to execute the query.

2.3 Cost calculation based on index statistics

Sometimes there are many single-point intervals when using an index to execute a query. For example, using the IN statement can easily generate a lot of single-point intervals, such as the query below (the ... in the query statement below indicates that there are many parameters):

select * from demo8 where key1 in ('aa', 'bb', 'cc', ... , 'ee');

Obviously, the index that may be used in this query is idx_key1. Since this index is not the only secondary index, it is not possible to determine the number of secondary index records corresponding to a single-point interval. We need to calculate it. The calculation method has been introduced above, which is to first obtain the leftmost record and the rightmost record of the interval of the B+ tree corresponding to the index, and then calculate how many records are between these two records (it can be done when the number of records is small) Accurate calculation, sometimes only estimates). MySQL calls this method of calculating the number of index records corresponding to a certain range interval by directly accessing the B+ tree corresponding to the index index dive.

小提示:
The literal translation of dive into Chinese means diving and swooping. Forgive my English. index dive, index dive? Index swoop? It doesn't seem to be suitable, so I won't translate it at all. However, everyone should understand that index dive is to directly use the B+ tree corresponding to the index to calculate the number of records corresponding to a certain range.

If there are several single-point intervals, it is not a problem to use the index dive method to calculate the number of records corresponding to these single-point intervals, but you can’t stand some friends trying to stuff things into the IN statement. If there are 20,000 records in the IN statement parameter, which means that the MySQL query optimizer needs to perform 20,000 index dive operations in order to calculate the number of index records corresponding to these single-point intervals. The cost of the number of index records is higher than the cost of direct full table scan. Of course, MySQL has considered this situation, so it provides a system variable eq_range_index_dive_limit. Let's take a look at the default value of this system variable:

mysql> show variables like '%dive%';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| eq_range_index_dive_limit | 200   |
+---------------------------+-------+
1 row in set (0.00 sec)

That is to say, if the number of parameters in our IN statement is less than 200, index dive will be used to calculate the number of records corresponding to each single-point interval. If it is greater than or equal to 200, index dive cannot be used. , is estimated using so-called index statistics. What kind of estimate? Read on.

MySQL will maintain a statistical data for each table, and MySQL will also maintain a statistical data for each index in the table. The syntax that can be used to view the statistical data of an index in a table, for example, we look at show index from 表名each demo8index The stats can be written like this:

mysql> show index from demo8;
+-------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name     | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| demo8 |          0 | PRIMARY      |            1 | id          | A         |       18750 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| demo8 |          0 | idx_key2     |            1 | key2        | A         |       18565 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| demo8 |          1 | idx_key1     |            1 | key1        | A         |         256 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| demo8 |          1 | idx_key3     |            1 | key3        | A         |        4053 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| demo8 |          1 | idx_key_part |            1 | key_part1   | A         |       16122 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| demo8 |          1 | idx_key_part |            2 | key_part2   | A         |       18570 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| demo8 |          1 | idx_key_part |            3 | key_part3   | A         |       18570 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
+-------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
7 rows in set (0.02 sec)

Are there many attributes, but these attributes are not difficult to understand, we will briefly introduce these attributes here:

attribute name describe
Table The name of the table to which the index belongs
Not_unique Whether the value of the index column is unique, the column value of the clustered index and the unique secondary index is 0, and the value of the column of the ordinary secondary index is 1
Key_name index name
Seq_in_index The position of the index column in the index, counting from 1. For example, for the joint index idx_key_part, the corresponding positions of key_part1, key_part2 and key_part3 are 1, 2 and 3 respectively
Column_name the name of the index column
Collation The sorting method for storing the values ​​in the index column. When the value is A, it means storing in ascending order, and when it is NULL, it means storing in descending order
Cardinality The number of unique values ​​in the index column. We will focus on this property later.
Sub_part For columns that store strings or byte strings, sometimes we only want to index the first n characters or bytes of these strings, and this attribute represents the n value. If the complete column is indexed, the value of this property is NULL
Packed How the index column is compressed, NULL value means not compressed. We don't understand this attribute for the time being, so we can ignore it first.
Null Whether the index column allows storing NULL values.
Index_type The type of index used, the most commonly used one is BTREE, which is actually the B+ tree index
Comment Index column comment information
Index_comment Index Annotation Information

PackedExcept that you may not understand the above attributes , there should be nothing that you cannot understand. If there are any, you must have skipped them when you read the previous article. In fact, what we are most concerned about now is the Cardinality attribute, Cardinalitywhich literally 基数means the number of unique values ​​in the index column. For example, for a table with 10,000 records, the Cardinality attribute of an index column is 10000, which means that there are no duplicate values ​​in the column. If the Cardinality attribute is 1, it means that all the values ​​​​of the column are duplicates. of. However, it should be noted that for the InnoDB storage engine, the show indexCardinality attribute of an index column displayed by the statement is one 估计值and not accurate. Let's talk about how the value of this Cardinality attribute is calculated later, let's see what it is used for.

As mentioned earlier, when the number of parameters in the IN statement is the value of >=the system variable eq_range_index_dive_limit, the method will not be used index diveto calculate the number of index records corresponding to each single-point interval, but the index statistical data, the index statistical data referred to here Refers to these two values:

  • Use the value show table statusdisplayed by the statement Rows, that is, how many records are in a table

  • Using the properties show indexdisplayed by the statementCardinality

Combined with the previous Rows statistics, we can calculate the average number of times a value is repeated for the index column. 一个值的重复次数≈Rows÷Cardinality, taking the index demo8of the table idx_key1as an example, its Rows value is 20187, which corresponds to Cardinalitythe value of the index column key1 256, so we can calculate the number of repetitions of the average single value of the key1 column:20187÷256≈79

Now look at the query statement above:

select * from demo8 where key1 in ('aa', 'bb', 'cc', ... , 'ee');

Assuming that there is a parameter in the IN statement 20000, the statistical data is directly used to estimate the number of records corresponding to the single-point range for these parameters. Each parameter corresponds to approximately 79one record, so the total number of records that need to be returned to the table is:20000 x 79 = 1580000

Using statistical data to calculate the number of index records corresponding to a single-point interval index diveis much simpler, but its fatal weakness is: the 不精确!query cost calculated using statistical data may be very different from the actual cost.

小提示:
When IN query is used in your query, but the index is not actually used, you should consider whether the eq_range_index_dive_limit value is too small

3. The cost of connection query

2.1 Prepare data

The connection query requires at least two tables, and only one demo8 table is not enough, so for the smooth development of the story, we directly construct two s1 and s2 tables that are exactly the same as the demo8 table

mysql> create table s1 (    
id int not null auto_increment,    
key1 varchar(100),    
key2 int,    
key3 varchar(100),    
key_part1 varchar(100),    
key_part2 varchar(100),    
key_part3 varchar(100),    
common_field varchar(100), 
primary key (id),
key idx_key1 (key1),    
unique key idx_key2 (key2),    
key idx_key3 (key3),    
key idx_key_part(key_part1, key_part2, key_part3));
Query OK, 0 rows affected (0.04 sec)

mysql> create table s2 (    
id int not null auto_increment,    
key1 varchar(100),    
key2 int,    
key3 varchar(100),    
key_part1 varchar(100),    
key_part2 varchar(100),    
key_part3 varchar(100),    
common_field varchar(100), 
primary key (id),
key idx_key1 (key1),    
unique key idx_key2 (key2),    
key idx_key3 (key3),    
key idx_key_part(key_part1, key_part2, key_part3));
Query OK, 0 rows affected (0.04 sec)

mysql> insert into s1 select * from demo8;
Query OK, 20000 rows affected (0.83 sec)
Records: 20000  Duplicates: 0  Warnings: 0

mysql> insert into s2 select * from demo8;
Query OK, 20000 rows affected (0.89 sec)
Records: 20000  Duplicates: 0  Warnings: 0

2.2 Condition filtering introduction

As we said before, the join query in MySQL uses the nested loop join algorithm, the driving table will be accessed once, and the driven table may be accessed multiple times, so for the join query of two tables, its query cost is as follows Consists of two parts:

  • The cost of a single query driving the table
  • The cost of querying the driven table multiple times ( 具体查询多少次取决于对驱动表查询的结果集中有多少条记录)

We call the number of records obtained after querying the driving table as the driving table 扇出(English name: fanout). Obviously, the smaller the fan-out value of the driving table, the fewer the number of queries to the driven table, and the lower the total cost of the connection query. When the query optimizer wants to calculate the cost of the entire join query, it needs to calculate the fan-out value of the driving table. Sometimes the calculation of the fan-out value is very easy, such as the following two queries:

Query one:

select * from s1 inner join s2;

Assuming that s1the table is used as the driving table, it is obvious that the single-table query of the driving table can only be performed by full table scanning, and the fan-out value of the driving table is also very clear, that is, how many records are in the driving table, the fan-out value is the number . The number of records in the s1 table in the statistical data is yes 20250, that is to say, the optimizer will directly 20250regard it as the fan-out value in s1the table.

Query two:

select * from s1 inner join s2 where s1.key2 > 10 and s1.key2 < 1000;

Still assuming that the s1 table is the driving table, it is obvious that single-table queries on the driving table can use the idx_key2 index to perform queries. At this time, how many records are there in the range interval (10, 1000) of idx_key2, then what is the fan-out value. We calculated earlier that the number of records satisfying the range interval (10, 1000) of idx_key2 is 989, which means that in this query, the optimizer will regard 95 as the fan-out value of driving table s1.

Of course, things don't always go smoothly, or the plot would be too flat. Sometimes the calculation of the fan-out value becomes tricky, such as the following query:

Query three:

select * from s1 inner join s2 where s1.common_field > 'xyz'

common_field > 'xyz'This query is similar to query 1, except that there is one more search condition for the driving table s1 . The query optimizer will not actually execute the query, so how many records in 只能猜this record satisfy the common_field > 'xyz' condition20250

Query four:

select * from s1 inner join s2 where s1.key2 > 10 and s1.key2 < 1000 and s1.common_field > 'xyz'

However, because this query can use idx_key2indexes, it is only necessary to guess how many records meet the conditions from the records that meet the range of the secondary index common_field > 'xyz', that is, only how many of 989the records meet common_field > 'xyz'the conditions

Query five:

select * from s1 inner join s2 where s1.key2 > 10 and s1.key2 < 1000 and s1.key1 in('aa','bb','cc') and s1.common_field > 'xyz'

This query is similar to query 2, but after the drive table s1selects idx_key1the index to execute the query, the optimizer needs to select how many records from the records that meet the range of the secondary index meet the following two conditions:

  • key2 > 10 and key2 < 1000
  • common_field > ‘xyz’

That is, the optimizer needs to guess 230how many of the records meet the above two conditions.

Having said so much, I actually want to express that in these two cases, it is necessary to rely on guesswork when calculating the fan-out value of the drive table:

  • If you use a single-table query executed by full table scan, then you need to guess how many records satisfy the search condition when calculating the fan-out of the driving table
  • If you are using a single-table scan performed by an index, you need to guess how many records satisfy other search conditions other than those using the corresponding index when calculating the fan-out of the driving table.

MySQL calls this guessing process condition filtering. Of course, this process may use indexes or statistical data, or it may be pure guesswork by MySQL. The entire evaluation process is quite complicated, so we skip it.

2.3 Cost analysis of multi-table connection

Here we first consider how many connection sequences may be generated during multi-table connection:

  • For the connection of two tables, such as the connection between table A and table B, there are only two connection sequences of AB and BA. In fact, it is equivalent to 2 × 1 = 2 connection sequences
  • For the connection of three tables, such as table A, table B, and table C, there are six connection sequences of ABC, ACB, BAC, BCA, CAB, and CBA. In fact, it is equivalent to 3 × 2 × 1 = 6 connection sequences
  • For four table connections, there will be 4 × 3 × 2 × 1 = 24 connection sequences
  • For the connection of n tables, there are n × (n-1) × (n-2) × ··· × 1 connection sequence, which is the factorial connection sequence of n, that is, n!

4. Adjustment cost constant

We introduced two earlier 成本常数:

  • The default cost of reading a page is:0.25
  • The default cost of checking whether a record matches the search criteria is:0.1

In fact, in addition to these two cost constants, MySQL also supports a lot, and they are stored in mysqltwo tables of the database (this is a system database, which we introduced before):

mysql> show tables from mysql like '%cost%';
+--------------------------+
| Tables_in_mysql (%cost%) |
+--------------------------+
| engine_cost              |
| server_cost              |
+--------------------------+
2 rows in set (0.06 sec)

As we said before, the execution of a statement is actually divided into two layers:

  • server layer
  • storage engine layer

Inserver层 connection management, query cache, syntax analysis, query optimization and other operations, specific data access operations are performed at the storage engine layer. That is to say, the cost of executing a statement in the server layer has nothing to do with the storage engine used by the table it operates on, so the cost constants corresponding to these operations are stored in the server_cost table, and depend on some operations of the storage engine The corresponding cost constant is stored in the engine_cost table

4.1 server_cost table

The cost constants corresponding to some operations performed on the server layer in the server_cost table are as follows:

mysql> select * from mysql.server_cost;
+------------------------------+------------+---------------------+---------+---------------+
| cost_name                    | cost_value | last_update         | comment | default_value |
+------------------------------+------------+---------------------+---------+---------------+
| disk_temptable_create_cost   |       NULL | 2023-04-24 19:39:12 | NULL    |            20 |
| disk_temptable_row_cost      |       NULL | 2023-04-24 19:39:12 | NULL    |           0.5 |
| key_compare_cost             |       NULL | 2023-04-24 19:39:12 | NULL    |          0.05 |
| memory_temptable_create_cost |       NULL | 2023-04-24 19:39:12 | NULL    |             1 |
| memory_temptable_row_cost    |       NULL | 2023-04-24 19:39:12 | NULL    |           0.1 |
| row_evaluate_cost            |       NULL | 2023-04-24 19:39:12 | NULL    |           0.1 |
+------------------------------+------------+---------------------+---------+---------------+
6 rows in set (0.00 sec)

Let's first look at what each column of server_cost means:

  • cost_name: Indicates the name of the cost constant
  • cost_value: Indicates the value corresponding to the cost constant. If the value of this column is NULL, it means that the corresponding cost constant will adopt the default value
  • last_update: Indicates the time when the record was last updated
  • comment: comment
  • default_value:Defaults

It can be seen from the content in server_cost that the cost constants corresponding to some operations on the server layer are as follows:

cost constant name Defaults describe
disk_temptable_create_cost 40.0 The cost of creating disk-based temporary tables. If you increase this value, the optimizer will create as few disk-based temporary tables as possible.
disk_temptable_row_cost 1.0 The cost of writing or reading a record to a disk-based temporary table. If you increase this value, the optimizer will create as few disk-based temporary tables as possible
key_compare_cost 0.1 The cost of comparing two records is mostly used in sorting operations. If this value is increased, the cost of filesort will be increased, and the optimizer may be more inclined to use indexes to complete sorting instead of filesort
memory_temptable_create_cost 2.0 The cost of creating memory-based temporary tables. If you increase this value, the optimizer will create as few memory-based temporary tables as possible.
memory_temptable_row_cost 0.2 The cost of writing or reading a record to a memory-based temporary table. If you increase this value, the optimizer will create as few memory-based temporary tables as possible
row_evaluate_cost 0.2 This is the cost of detecting whether a record meets the search criteria that we have been using before. Increasing this value may make the optimizer more inclined to use indexes instead of direct full table scans

小提示:
MySQL may create a temporary table internally when executing queries such as DISTINCT queries, grouping queries, Union queries, and sorting queries under certain special conditions, and use this temporary table to assist in completing the query (for example, for DISTINCT queries, you can create one with The temporary table of the UNIQUE index directly inserts the records that need to be deduplicated into this temporary table, and the record after the insertion is completed is the result set). In the case of a large amount of data, it is possible to create a disk-based temporary table, that is, to use storage engines such as MyISAM and InnoDB for the temporary table, and to create a memory-based temporary table when the amount of data is not large, that is, to use the Memory storage of the product engine. Everyone here knows that the cost of creating a temporary table and writing and reading this temporary table is still very high.

server_costThe initial values ​​of these cost constants NULLare , which means that the optimizer will use their default values ​​to calculate the cost of an operation. If we want to modify the value of a certain cost constant, we need to do two steps:

Step 1: Update the cost constant we are interested in

For example, if we want to increase the cost of checking whether a record meets the search criteria to 0.3, then we can write an update statement like this:

update mysql.server_cost set cost_value = 0.4 where cost_name = 'row_evaluate_cost';

Step 2: Let the system reload the value of this table, just use the following statement

flush optimizer_costs;

Of course, if you want to change them back after modifying a certain cost constant 默认值, you can directly cost_valueset the value to NULL, and then use flush optimizer_coststhe statement to let the system reload it.

4.2 engine_cost table

engine_costThe cost constants corresponding to some operations performed at the storage engine layer in the table are as follows:

mysql> select * from mysql.engine_cost;
+-------------+-------------+------------------------+------------+---------------------+---------+---------------+
| engine_name | device_type | cost_name              | cost_value | last_update         | comment | default_value |
+-------------+-------------+------------------------+------------+---------------------+---------+---------------+
| default     |           0 | io_block_read_cost     |       NULL | 2023-04-24 19:39:12 | NULL    |             1 |
| default     |           0 | memory_block_read_cost |       NULL | 2023-04-24 19:39:12 | NULL    |          0.25 |
+-------------+-------------+------------------------+------------+---------------------+---------+---------------+
2 rows in set (0.01 sec)

Compared with server_cost, engine_costthere are two more columns:

  • engine_nameColumn: Refers to the storage engine name to which the cost constant applies. If the value is default, it means that the corresponding cost constant applies to all storage engines
  • device_typeColumn: refers to the device type used by the storage engine. This is mainly to distinguish between conventional mechanical hard disks and solid-state hard disks. However, in MySQL 5.7.21,
    the The value is 0 by default

We can see from the contents of the engine_cost table that there are only two storage engine cost constants currently supported:

cost constant name Defaults describe
io_block_read_cost 1.0 The cost of reading a block from disk. Note that I use the word block, not page. For the InnoDB storage engine, a page is a block, but for the MyISAM storage engine, the default is 4096 bytes as a block. Increasing this value will increase the cost of I/O, which may make the optimizer more inclined to choose to use the index to perform queries instead of performing full table scans
memory_block_read_cost 0.25 Similar to the previous parameter, except that it measures the cost corresponding to reading a block from memory

After reading the default values ​​of these two cost constants, are you a little confused? Why is the default cost of reading a block from memory different from that from disk? This is mainly because, as MySQL evolves, MySQL can accurately predict which blocks are on disk and which are in memory.

Like updating server_costthe records in the table, we can also engine_costchange the cost constant about the storage engine by updating the records in the table, and we can also engine_costadd a cost constant specific to a certain storage engine by inserting a new record for the table:

Step 1: Insert a cost constant for a certain storage engine.
For example, if we want to increase the I/O cost of the InnoDB storage engine page, just write a normal insert statement:

insert into mysql.engine_cost values ('innodb', 0, 'io_block_read_cost', 2.0, current_timestamp, 'increase innodb i/o cost');

Step 2: Let the system reload the value of this table using the following statement:

flush optimizer_costs;

So far, today's study is over, I hope you will become an indestructible self
~~~

You can’t connect the dots looking forward; you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future.You have to trust in something - your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all the difference in my life

If my content is helpful to you, please 点赞, 评论,, 收藏creation is not easy, everyone's support is the motivation for me to persevere

insert image description here

Guess you like

Origin blog.csdn.net/liang921119/article/details/130779501