MySQL Optimization Lesson 5: MySQL Index Optimization Practice II

Paging query optimization

DROP TABLE IF EXISTS `employees`;
CREATE TABLE `employees`(
`id` INT(11) NOT NULL AUTO_INCREMENT,
`name` VARCHAR(24) NOT NULL DEFAULT '' COMMENT '姓名',
`age` INT(11) NOT NULL DEFAULT '0' COMMENT '年龄',
`position` VARCHAR(20) NOT NULL DEFAULT '' COMMENT '职位',
`hire_time` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '入职时间',
PRIMARY KEY (`id`),
KEY `idx_name_age_position` (`name`,`age`,`position`) USING BTREE
)ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='员工记录表';

In many cases, the paging function of our business system may be implemented with the following sql

select * from employees limit 10000,10;

Indicates that 10 rows of records starting from row 10001 are taken from the table employees. It seems that only 10 records are queried, but this SQL actually reads 10010 records first, then discards the first 10000 records, and then reads the next 10 desired data. Therefore, to query the data behind a large table, the execution efficiency is very low.

Common paging scene optimization techniques:

1. Paging query sorted by self-incrementing and continuous primary key

Let's first look at an example of a paginated query sorted by an auto-incrementing and continuous primary key:

SELECT * FROM employees LIMIT 90000,5;

insert image description here

This SQL means to query five pieces of data starting from No. 90001, without adding a separate order by, it means that it is passed 主键排序, let's look at the table employee, because the primary key is self-incrementing and continuous, so it can be rewritten to query starting from No. 90001 according to the primary key Five lines of data

SELECT * FROM employees WHERE id > 90000 LIMIT 5;

insert image description here

The query results are consistent. Let's compare the execution plan again:

explain SELECT * FROM employees LIMIT 90000,5;

insert image description here

EXPLAIN SELECT * FROM employees WHERE id > 90000 LIMIT 5;

insert image description here
Obviously, the rewritten SQL has gone through the index, and the number of scanned rows is greatly reduced, and the execution efficiency is high.
However, this rewritten SQL is not practical in many scenarios, because after some records in the table are deleted, the primary key may be vacant, resulting in inconsistent results.

2. Paging query sorted by non-primary key fields

Let's look at a paging query sorted by non-primary key fields. The SQL is as follows:

SELECT * FROM employees ORDER BY NAME LIMIT 90000,5;

insert image description here

EXPLAIN SELECT * FROM employees ORDER BY NAME LIMIT 90000,5;

insert image description here
It is found that the index of the name field is not used (the value corresponding to the key field is null), the specific reason: the cost of scanning the entire index and finding unindexed rows (may have to traverse multiple index trees) is higher than the cost of scanning the entire table , so the optimizer abandons the use of the index.
Know the reason for not taking the index, so how to optimize it?
In fact, the key is 让排序时返回的字段尽可能少, so the sorting and paging operations can first find the primary key, and then find the corresponding record according to the primary key. The SQL is rewritten as follows

explain SELECT * FROM employees e INNER JOIN (SELECT id FROM employees ORDER BY NAME LIMIT 90000,5) ed ON e.id = ed.id;

insert image description here
The desired result is consistent with the original SQL, and the execution time is reduced by more than half. The original SQL uses filesort sorting, while the optimized SQL uses index sorting.

Join association query optimization

CREATE TABLE `t1`(
`id` INT(11) NOT NULL AUTO_INCREMENT,
`a` INT(11) DEFAULT NULL,
`b` INT(11) DEFAULT NULL,
PRIMARY KEY(`id`),
KEY `idx_a` (`a`)
)ENGINE=INNODB DEFAULT CHARSET=utf8;

CREATE TABLE t2 LIKE t1;

-- 插入一些示例数据
-- 往t1表插入1万行记录
DROP PROCEDURE IF EXISTS insert_t1;
DELIMITER ;;
CREATE PROCEDURE insert_t1()
BEGIN
DECLARE i INT;
SET i=1;
WHILE(i<10000)DO
	INSERT INTO t1(a,b) VALUES(i,i);
	SET i = i+1;
END WHILE;
END;;
DELIMITER;
-- 调用存储过程
CALL insert_t1();
-- 往t2表插入100行记录
DROP PROCEDURE IF EXISTS insert_t2;
DELIMITER ;;
CREATE PROCEDURE insert_t2()
BEGIN
	DECLARE i INT;
	SET i = 1;
	WHILE(i<100)DO
		INSERT INTO t2(a,b) VALUES(i,i);
		SET i = i+1;
	END WHILE;
END;;
DELIMITER ;
CALL insert_t2();

There are two common algorithms for table association in mysql

  • Nested-Loop Join Algorithm
  • Block Nested-Loop Join算法

1. Nested-Loop Join (NLJ) algorithm

Read rows from the first table (called the driving table) one row at a time in a loop, get the associated fields in this row of data, and retrieve the rows that meet the conditions in another table (driven table) according to the associated fields, Then take the result set of the two tables.

EXPLAIN SELECT * FROM t1 INNER JOIN t2 ON t1.a=t2.a;

insert image description here
This information can be seen from the execution plan:

  • The driving table is t2 and the driven table is t1. The driver table is executed first (if the id of the execution plan result is the same, the sql will be executed in order from top to bottom); the optimizer will generally choose it first 小表做驱动表. So when using inner join, the table in front is not necessarily the driving table.
  • When using left join, the left table is the driving table, and the right table is the driven table. When using right join, the right table is the driving table, and the left table is the driven table. When using join, MySQL will select a smaller amount of data. The large table is used as the driven table, and the large table is used as the driven table.
  • The NLJ algorithm was used. In a general join statement, if it does not appear in the execution plan Extra, Using join bufferit means that the join algorithm used is NLJ.

The general process of the above sql is as follows:

  1. Read a row of data from table t2 (if table t2 has query filter conditions, a row of data will be taken from the filter result);
  2. From the data in step 1, take out the associated field a and search it in table t1;
  3. Take out the rows that satisfy the condition in table t1, merge with the result obtained in t2, and return it to the client as the result;
  4. Repeat the above 3 steps.

The whole process will read all the data in the t2 table ( scan 100 rows ), then traverse the value of field a in each row of data, and scan the corresponding row in the t1 table according to the value index of a in the t2 table ( scan 100 times of the t1 table. Index, 1 scan can be considered that only one row of complete data in the t1 table is scanned in the end, that is, a total of 100 rows in the t1 table are also scanned ). So the whole process scans 200 rows.

If the associated fields of the driven table are not indexed, the performance of the NLJ algorithm will be relatively low (detailed below) , and mysql will choose the Block Nested-Loop Join algorithm.

2. Block Nested-Loop Join (BNL) algorithm based on block nested loop join

Read the data of the driving table into join_buffer, then scan the driven table, take out each row of the driven table and compare it with the data in join_buffer.

EXPLAIN SELECT * FROM t1 INNER JOIN t2 ON t1.b=t2.b;

insert image description here
Using join buffer (Block Nested Loop) in Extra indicates that the associated query uses the BNL algorithm.
The screenshot above is version 8.0, which introduces a new algorithm - MySQL 8.0 new feature: Hash Join

The general process of the above sql is as follows:

  1. Put all the data join_bufferof
  2. Take out each row in table t1 and compare it with the data in join_buffer
  3. Returns data that satisfies the join condition

The whole process performs a full table scan on both tables t1 and t2, so the total number of rows scanned is 10000 (the total amount of data in table t1) + 100 (the total amount of data in table t2) = 10100 . And the data in join_buffer is out of order, so 100 judgments are required for each row in table t1, so the number of judgments in memory is 100 * 10000= 1 million times .

In this example, the table t2 has only 100 rows. If the table t2 is a large table, what should I do if the join_buffer cannot fit it?

The size of join_buffer is set by the parameter join_buffer_size, the default value is 256k. If you can't put all the data in table t2, the strategy is very simple, that is, put it in segments .

For example, the t2 table has 1000 rows of records, and the join_buffer can only hold 800 rows of data at a time, so the execution process is to first put 800 rows of records in the join_buffer, then take the data from the t1 table and compare the data in the join_buffer to get some results, and then empty the join_buffer, Then put the remaining 200 rows of records in the t2 table, and compare the data from the t1 table with the data in the join_buffer again. So I scanned the t1 table one more time.

The associated fields of the driven table are not indexed. Why choose to use the BNL algorithm instead of Nested-Loop Join?
If the second sql above uses Nested-Loop Join, then the number of scanned rows is 100 * 10000 = 1 million times, which is disk scan . Obviously, the number of disk scans with BNL is much less, and the memory calculation of BNL will be much faster than that of disk scan. Therefore, MySQL generally uses the BNL algorithm for the associated query where the associated field of the driven table is not indexed. If there is an index, the NLJ algorithm is generally selected. In the case of an index, the NLJ algorithm has higher performance than the BNL algorithm.

Optimization for associative sql

  • Add an index to the associated field , and let MySQL choose the NLJ algorithm as much as possible when doing the join operation

  • Small tables drive large tables . When writing multi-table join SQL, if you clearly know which table is the small table, you can use the straight_join writing method to fix the connection driving method, saving the time for the MySQL optimizer to judge by itself.

straight_join解释: The function of straight_join is similar to join, but it can make the table on the left drive the table on the right, and can change the execution order of the table optimizer for join table queries.
For example: select * from t2 straight_join t1 on t2.a = t1.a; means that mysql selects the t2 table as the driver table.

  • straight_joinOnly applicable to inner join , not applicable to left join, right join. (Because left join, right join already represents the execution order of the specified table)

  • Let the optimizer decide as much as possible, because in most cases the mysql optimizer is smarter than humans. It must be used with straight_joincaution, because in some cases, the execution order specified by humans may not be more reliable than the optimization engine.

The definition of the small table is clear.
When deciding which table is the driving table, the two tables should be filtered according to their respective conditions. After the filtering is completed , the total data volume of each field participating in the join is calculated. The table with the smaller data volume is "Small table" should be used as the driver table.

in and exists optimization

Principle: Small tables drive large tables , that is, small data sets drive large data sets
in : When the data set of table B is smaller than the data set of table A, in is better than exists

select * from A where id in (select id from B);
#等价于
for(select id from B){
	select * from A where A.id = B.id
}

exists : When the data set of table A is smaller than the data set of table B, exists is better than in.
Put the data of main query A into sub-query B for conditional verification, and determine the main query according to the verification result (true or false). Whether the data is retained

select * from A where exists (select 1 from B where B.id = A.id);
#等价于
for(select * from A){
	select * from B where B.id = A.id
}
  1. EXISTS (subquery) only returns TRUE or FALSE, so the SELECT * in the subquery can also be replaced with SELECT 1, the official statement is that the SELECT list will be ignored during actual execution, so there is no difference.
  2. The actual execution process of the EXISTS subquery may be optimized rather than a point-by-point comparison in our understanding.
  3. EXISTS sub-queries can often be replaced by JOIN, and which is optimal requires specific analysis of specific problems.

count(*) query optimization

-- 临时关闭mysql查询缓存,为了查看mysql多次执行的真实时间,8.0移除了查询缓存,所以不需要关闭
set global query_cache_size=0;
set global query_cache_type=0;
explain select count(1) from employees;
explain select count(id) from employees;
explain select count(name) from employees;
explain select count(*) from employees;

Note: The above 4 sql items will only count data rows whose fields are null values ​​according to a certain field count

The execution plans of the four sqls are the same, indicating that the execution efficiency of these four sqls should be similar to that
of fields with indexes: count(*)≈count(1)>count(field)>count(primary key id) //The field has an index, and count(field ) Statistics go to the secondary index. The secondary index stores less data than the primary key index, so count(field)>count(primary key id)

The field has no index: count(*)≈count(1)>count(primary key id)>count(field) //The field has no index count(field) statistics cannot go through the index, count(primary key id) can also go through the primary key index, so count(primary key id)>count(field)

count(1) is similar to the execution process of count(field), but count(1) does not need to take out field statistics, so it uses the constant 1 for statistics, and count(field) also needs to take out fields, so in theory count(1) is better than count( field) will be faster.

count(*) is an exception, mysql does not take out all fields, but optimizes it specifically, does not take values, and accumulates by row, which is very efficient, so there is no need to use count (column name) or count (constant) to replace count(*).

Why does mysql end up choosing a secondary index instead of a primary key clustered index for count(id)? Because the secondary index stores less data than the primary key index, the retrieval performance should be higher, and MySQL has made some optimizations (it should be optimized in version 5.7).

Common optimization methods

1. Query the total number of rows maintained by mysql itself

The performance of the count query without the where condition is very high for myisam存储引擎the table, because the total number of rows in the table of the myisam storage engine will be stored on the disk by mysql, and the query does not need to be calculated.

For innodb存储引擎the table mysql does not store the total number of rows in the table (because of the MVCC mechanism, which will be discussed later), the query count needs to be calculated in real time.

2、show table status

If only need to knowAn estimate of the total number of rows in the tableYou can use the following sql query, with high performance

SHOW TABLE STATUS LIKE 'employees';

insert image description here

3. Maintain the total number in Redis

When inserting or deleting table data rows, the count value of the total number of table rows in redis is maintained at the same time (using the incr or decr command), but this method may be inaccurate, and it is difficult to ensure transaction consistency between table operations and redis operations.

4. Increase the database count table

When inserting or deleting table data rows, maintain the count table at the same time, let them operate in the same transaction.

Guess you like

Origin blog.csdn.net/upset_poor/article/details/122972924