MySQL Advanced - Associative Query and Subquery Optimization

navigation:

[Dark Horse Java Notes + Stepping on the Pit Summary] Java Basics + Advanced + JavaWeb + SSM + SpringBoot + St. Regis Takeaway + Spring Cloud + Dark Horse Travel + Grain Mall + Xuecheng Online + Design Mode + Niuke Interview Questions

Table of contents

1. Correlation query optimization

1.0 Optimization scheme

1.1 Data preparation

1.2 Left outer join: Prioritize the right table to create an index, and the connection field type should be consistent

1.3 Inner join: the driving table is determined by the amount of data and indexes

1.4 Principle of join statement

2. Subquery optimization: disassemble the query or optimize it into a join query


 

1. Correlation query optimization

1.0 Optimization scheme

  • Outer connection small table drives large table: When LEFT JOIN, select the small table as the driving table and the large table as the driven table. Reduce the number of outer loops.
  • The inner join driver table is determined by the optimizer: when INNER JOIN is executed, MySQL will automatically select a table with a small result set as the driver table. Choose to trust the MySQL optimization strategy.
  • Index creation is preferred for the driven table: the JOIN field of the driven table needs to create an index;
  • The connection field types of the two tables must be consistent: the data types of the JOIN fields of the two tables must be absolutely consistent. Prevent automatic type conversion from invalidating indexes.
  • Association instead of sub-query: Associating directly with multiple tables should be as direct as possible without sub-query. (reduce the number of times of query). A subquery is the result of one SELECT query used as the condition of another SELECT statement.
  • Multiple queries instead of subqueries: It is not recommended to use subqueries. It is recommended to disassemble the subquery SQL and combine the program with multiple queries, or use JOIN instead of subqueries.
  • Derived table cannot build index

1.1 Data preparation

# 分类
CREATE TABLE IF NOT EXISTS `type` (
`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`card` INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (`id`)
);
#图书
CREATE TABLE IF NOT EXISTS `book` (
`bookid` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`card` INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (`bookid`)
);

#向分类表中添加20条记录
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO `type`(card) VALUES(FLOOR(1 + (RAND() * 20)));

#向图书表中添加20条记录
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));

1.2 Left outer join: Prioritize the right table to create an index, and the connection field type should be consistent

Prioritize creating indexes on the right table: Because the left table searches all data, and the right table queries by conditions, creating indexes for the condition fields of the right table is more valuable.

The types of connection fields must be consistent: the card fields of the two tables must be of the same type. If the types are different, implicit type conversion will result and the index will become invalid. 

verify:

 EXPLAIN analyzes the left outer join

EXPLAIN SELECT SQL_NO_CACHE * FROM `type` LEFT JOIN book ON type.card = book.card;

Conclusion: type has All, full table scan.

Right table creation index optimization:

#ALTER TABLE book ADD INDEX Y ( card); #【被驱动表】,可以避免全表扫描
CREATE INDEX Y ON book(card);
EXPLAIN SELECT SQL_NO_CACHE * FROM `type` LEFT JOIN book ON type.card = book.card;

It can be seen that the type of the second line has changed to ref, and the rows have also become optimized. This is determined by the left join feature. The LEFT JOIN condition is used to determine how to search for rows from the right table , which must be available on the left, so the right is our key point, and an index must be established .

Left table creation index optimization:

ALTER TABLE `type` ADD INDEX X (card); #【驱动表】,无法避免全表扫描
EXPLAIN SELECT SQL_NO_CACHE * FROM `type` LEFT JOIN book ON type.card = book.card;

then:

DROP INDEX Y ON book;
EXPLAIN SELECT SQL_NO_CACHE * FROM `type` LEFT JOIN book ON type.card = book.card;

1.3 Inner join: the driving table is determined by the amount of data and indexes

The inner connection check is the intersection, and the result of the two tables who is the driving table check is the same. So the query optimizer will choose the driving table according to the query cost . The driving table is the master table, and the driven table is the slave table

The choice of drive table is based on:

  • Table without index: When only one table has an index, the query optimizer will choose the table without index as the driving table.
  • Small table: When both tables have or do not have indexes, the table with a small amount of data is the driving table.

verify:

drop index X on type;
drop index Y on book;(如果已经删除了可以不用再执行该操作)

 Inner join without index (MySQL automatically selects the driving table)

EXPLAIN SELECT SQL_NO_CACHE * FROM type INNER JOIN book ON type.card=book.card;

Add index optimization to the book table, and you can see that the driven table is the book table:

ALTER TABLE book ADD INDEX Y (card);
EXPLAIN SELECT SQL_NO_CACHE * FROM type INNER JOIN book ON type.card=book.card;

 

Add an index to the type table, and you can see that the driven table is a type table:

ALTER TABLE type ADD INDEX X (card);
EXPLAIN SELECT SQL_NO_CACHE * FROM type INNER JOIN book ON type.card=book.card;

For inner joins, the query optimizer can decide who is the driving table and who is the driven table

Delete the type table index and find :

DROP INDEX X ON `type`;
EXPLAIN SELECT SQL_NO_CACHE * FROM TYPE INNER JOIN book ON type.card=book.card;

then:

ALTER TABLE `type` ADD INDEX X (card);
EXPLAIN SELECT SQL_NO_CACHE * FROM `type` INNER JOIN book ON type.card=book.card;

then:

#向图书表中添加20条记录
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));
INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));

ALTER TABLE book ADD INDEX Y (card);
EXPLAIN SELECT SQL_NO_CACHE * FROM `type` INNER JOIN book ON `type`.card = book.card;

It is found in the figure that since the type table data is larger than the book table data, MySQL chooses type as the driven table.

 

1.4 Principle of join statement

The join method connects multiple tables, which is essentially the circular matching of data between tables. Prior to MySQL 5.5, MySQL only supported one form of association between tables, which was nested loop join. If the amount of data in the association table is large, the execution time of the join association will be very long. In versions after MySQL5.5, MySQL optimizes nested execution by introducing the BNLJ algorithm.

1. Driver table and driven table

The driving table is the main table, and the driven table is the slave table and the non-driving table.

  • For inner joins:
SELECT * FROM A JOIN B ON ...

Does A have to be the drive table? Not necessarily, the optimizer will optimize according to your query statement and decide which table to check first. The table queried first is the driving table, and vice versa is the driven table. It can be viewed through the explain keyword.

  • For outer joins:
SELECT * FROM A LEFT JOIN B ON ...
# 或
SELECT * FROM B RIGHT JOIN A ON ... 

Usually, people think that A is the driving table, and B is the driven table. But not necessarily. The test is as follows:

CREATE TABLE a(f1 INT, f2 INT, INDEX(f1)) ENGINE=INNODB;
CREATE TABLE b(f1 INT, f2 INT) ENGINE=INNODB;

INSERT INTO a VALUES(1,1),(2,2),(3,3),(4,4),(5,5),(6,6);
INSERT INTO b VALUES(3,3),(4,4),(5,5),(6,6),(7,7),(8,8);

SELECT * FROM b;

# 测试1
EXPLAIN SELECT * FROM a LEFT JOIN b ON(a.f1=b.f1) WHERE (a.f2=b.f2);

# 测试2
EXPLAIN SELECT * FROM a LEFT JOIN b ON(a.f1=b.f1) AND (a.f2=b.f2);

2. Simple Nested-Loop Join (simple nested loop connection)

The algorithm is quite simple. Take a piece of data 1 from table A, traverse table B, and put the matched data into result.. By analogy, each record in driving table A is judged against the record in driven table B:

It can be seen that the efficiency of this method is very low. Based on the calculation of 100 data in Table A and 1000 data in Table B above, A*B=100,000 times. The cost statistics are as follows:

Expenditure statistics SNLJ
appearance scans 1
Internal table scan times A
Number of records read A+B * A
Number of JOIN comparisons B * A
The number of records read back to the table 0

Of course, mysql will definitely not join tables so roughly, so the following two optimization algorithms for Nested-Loop Join appear.

3. Index Nested-Loop Join (index nested loop connection)

The optimization idea of ​​Index Nested-Loop Join is mainly to reduce the matching times of memory table data , so it is required that there must be an index on the driven table . The outer table matching conditions are used to directly match the inner table index to avoid comparison with each record in the memory table, which greatly reduces the number of matches to the memory table.

Each record in the driving table is accessed through the index of the driven table, because the cost of index query is relatively fixed, so the mysql optimizer tends to use a table with a small number of records as the driving table (outer table).

If the driven table is indexed, the efficiency is very high, but if the index is not the primary key index, it is necessary to perform a query back to the table. In comparison, the index of the driven table is a primary key index, which is more efficient.

4. Block Nested-Loop Join (block nested loop connection)

Notice:

Not only the columns of the associated table are cached here, but also the columns after the select are cached.

In a sql with N join associations, N-1 join buffers will be allocated. So try to reduce unnecessary fields when querying, so that more columns can be stored in the join buffer.

parameter settings:

  • block_nested_loop

View block_nested_loop status by show variables like '%optimizer_switch% . It is enabled by default.

  • join_buffer_size

Whether the driver table can be loaded at one time depends on whether the join buffer can store all the data. By default, join_buffer_size=256k .

mysql> show variables like '%join_buffer%';

The maximum value of join_buffer_size can apply for 4G in a 32-bit operating system, and can apply for a Join Buffer space larger than 4G in a 64-bit operating system (except for 64-bit Windows, where the maximum value will be truncated to 4GB and a warning will be issued).

5. Join summary

1. Overall efficiency comparison: INLJ > BNLJ > SNLJ

2. Always use a small result set to drive a large result set (its essence is to reduce the amount of data in the outer loop) (the small unit of measurement refers to the number of table rows * the size of each row)

select t1.b,t2.* from t1 straight_join t2 on (t1.b=t2.b) where t2.id<=100; # 推荐
select t1.b,t2.* from t2 straight_join t1 on (t1.b=t2.b) where t2.id<=100; # 不推荐

3. Add indexes for the matching conditions of the driven table (reduce the number of loop matches of the memory table)

4. Increase the size of the join buffer size (the more data you index at one time, the fewer times you scan the inner package)

5. Reduce unnecessary field queries in the driver table (the fewer fields, the more data cached by the join buffer)

6. Hash Join

Starting from version 8.0.20 of MySQL, BNLJ will be discarded, because hash join has been added since version 8.0.18 of MySQL, and hash join will be used by default.

  • Nested Loop:

    Nested Loop is a better choice when the subset of data being connected is small.

  • Hash Join is a common method for connecting large data sets. The optimizer uses the smaller (relatively smaller) table of the two tables to create a hash table in memory using the Join Key , and then scans the larger table and detects the hash table. Find the rows that match the Hash table.

    • This method is suitable for the case where the smaller table can be placed in memory, so the total cost is the sum of the cost of accessing the two tables.
    • In the case of a large table, it cannot be completely stored in the memory. At this time, the optimizer will divide it into several different partitions , and the part that cannot be stored in the memory will be written to the temporary segment of the disk. Large temporary segments to maximize I/O performance.
    • It can work very well in the environment of large tables without indexes and parallel queries, and provides the best performance. Most people say it's Join's heavy lift. Hash Join can only be applied to equivalence joins (such as WHERE A.COL1 = B.COL2), which is determined by the characteristics of Hash.

2. Subquery optimization: disassemble the query or optimize it into a join query

MySQL has supported subqueries since version 4.1. Using subqueries, you can perform nested queries of SELECT statements, that is, the result of one SELECT query is used as the condition of another SELECT statement . Subqueries can complete many SQL operations that logically require multiple steps to complete at one time.

Subquery is an important function of MySQL, which can help us realize more complex queries through a SQL statement. However, subqueries do not perform very efficiently. reason:

① When executing a subquery, MySQL needs to create a temporary table for the query results of the inner query statement , and then the outer query statement queries records from the temporary table. After the query is completed, these temporary tables are revoked. This will consume too much CPU and IO resources and generate a large number of slow queries.

② The temporary table stored in the result set of the subquery, whether it is a temporary memory table or a temporary disk table, does not have an index, so the query performance will be affected to a certain extent.

③ For a subquery that returns a relatively large result set, its impact on query performance is greater.

In MySQL, a join (JOIN) query can be used instead of a subquery. The connection query does not need to create a temporary table , and its speed is faster than that of the subquery . If the index is used in the query, the performance will be better.

Example 1: Query the information of the student who is the monitor in the student table

  • use subquery
# 创建班级表中班长的索引
CREATE INDEX idx_monitor ON class(monitor);

EXPLAIN SELECT * FROM student stu1
WHERE stu1.`stuno` IN (
SELECT monitor
FROM class c
WHERE monitor IS NOT NULL
)
  • It is recommended to use multi-table query
EXPLAIN SELECT stu1.* FROM student stu1 JOIN class c
ON stu1.`stuno` = c.`monitor`
WHERE c.`monitor` is NOT NULL;

Example 2: Get all students who are not monitors

  • Not recommended
EXPLAIN SELECT SQL_NO_CACHE a.*
FROM student a
WHERE a.stuno NOT IN (
    SELECT monitor FROM class b
    WHERE monitor IS NOT NULL
);

The execution results are as follows:

  • recommend:
EXPLAIN SELECT SQL_NO_CACHE a.*
FROM student a LEFT OUTER JOIN class b
ON a.stuno = b.monitor
WHERE b.monitor IS NULL;

Conclusion: Try not to use NOT IN or NOT EXISTS, use LEFT JOIN xxx ON xx WHERE xx IS NULL instead

Guess you like

Origin blog.csdn.net/qq_40991313/article/details/130787410