Join association, in and exsits optimization, count(*) query optimization

Join related query optimization

Table t1:

 CREATE TABLE `t1` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `a` int(11) DEFAULT NULL,
`b` int(11) DEFAULT NULL,
 PRIMARY KEY (`id`),
KEY `idx_a` (`a`)
) ENGINE=InnoDB AUTO_INCREMENT=10001 DEFAULT CHARSET=utf8;

Table t2:

 create table t2 like t1;

Insert 10,000 rows of records into table t1 and 100 rows of records into table t2

There are two common algorithms for mysql table association

Nested-Loop Join algorithm

Block Nested-Loop Join 算法

1. Nested-Loop Join (NLJ) algorithm

Read rows from the first table (called the driving table) one row at a time, get the associated fields in this row of data, and fetch the rows that meet the conditions in another table (the driven table) according to the associated fields. Then take out the result collection of the two tables.
EXPLAIN select*from t1 inner join t2 on t1.a = t2.a;


You can see this information from the execution plan:

  • The driven table is t2 and the driven table is t1. The first execution is the driving table (if the id of the execution plan result is the same, the sql is executed in order from top to bottom); the optimizer generally prefers small tables as the driving table. Therefore, when using inner join, the top table is not necessarily the driving table.
  • The NLJ algorithm is used. In a general join statement, if the Using join buffer does not appear in the execution plan Extra, it means that the join algorithm used is NLJ.

The general process of the above sql is as follows:
1. Read a row of data from table t2;
2. From the data in step 1, take out the associated field a and search in table t1;
3. Take out the rows that meet the conditions in table t1, Combine with the result obtained in t2, and return it to the client as the result;
4. Repeat the above 3 steps.
The whole process will read all the data in the t2 table (scan 100 rows), and then traverse the value of field a in each row of data, and scan the corresponding row in the t1 table according to the value of a in the t2 table (scan 100 times in the t1 table) Index, 1 scan can be considered that only one row of the complete data of the t1 table is scanned in the end, that is, a total of 100
rows of the t1 table are scanned ). So 200 lines were scanned in the whole process.
If the associated fields of the driven table are not indexed, the performance of using the NLJ algorithm will be lower (detailed explanation below), and mysql will choose the Block Nested-Loop Join algorithm.

2. Block Nested-Loop Join (BNL) algorithm based on block nested loop join

Read the data of the driving table into join_buffer, then scan the driven table, and fetch each row of the driven table to compare with the data in join_buffer.

EXPLAIN select*from t1 inner join t2 on t1.b= t2.b;

The Using join buffer (Block Nested Loop) in Extra indicates that the associated query uses the BNL algorithm.
The general flow of the above SQL is as follows:
1. Put all the data of t2 into join_buffer
2. Take out each row in table t1 and compare it with the data in join_buffer
3. Return the data that meets the join conditions. The
whole process is for table t1 Both and t2 have done a full table scan, so the total number of rows scanned is 10000 (the total amount of data in table t1) + 100 (the total amount of data in table t2) = 10100. And the data in join_buffer is out of order, so 100 judgments are required for each row in table t1, so the number of judgments in the memory is 100 * 10000 = 1 million times.
The associated fields of the driven table are not indexed. Why choose to use the BNL algorithm instead of Nested-Loop Join?
If the second sql above uses Nested-Loop Join, then the number of scanned rows is 100 * 10000 = 1 million times, this is a disk scan.
Obviously, the number of scans with BNL disk is much less, compared to disk scan, BNL's memory calculation will be much faster.
Therefore, MySQL generally uses the BNL algorithm for associative queries where the associated fields of the driven table are not indexed. If there is an index, the NLJ algorithm is generally selected. If there is an index, the NLJ algorithm has higher performance than the BNL algorithm
. Optimization of associated SQL

  • Associated fields plus indexes, try to choose NLJ algorithm when mysql do join operation
  • Small standard drives large tables. When writing multi-table join sql, if you clearly know which table is a small table, you can use straight_join to fix the connection drive mode, saving the time of mysql optimizer's own judgment.

Straight_join explanation: The straight_join function is similar to join, but it allows the table on the left to drive the table on the right, and can change the execution order of the table optimizer for the query of the join table.
For example: select * from t2 straight_join t1 on t2.a = t1.a; It means that mysql selects the t2 table as the driving table.

  • Straight_join only applies to inner join, not to left join and right join. (Because left join, right join already represents the execution order of the specified table)
  • Let the optimizer judge as much as possible, because in most cases the mysql optimizer is smarter than people. Use straight_join must be cautious, because in some cases, the artificially specified execution order may not be more reliable than the optimization engine.

in and exsits optimization

Principle: Small tables drive large tables, that is, small data sets drive large data sets
in: When the data set of table B is smaller than the data set of table A, in is better than exists

 select * from A where id in (select id from B)

#Equivalent to:

 for(select id from B){
select * from A where A.id = B.id
 }

exists: When the data set of table A is smaller than the data set of table B, exists is better than in.
Put the data of main query A into subquery B for conditional verification, and determine the main query based on the verification result (true or false) Whether the data is retained

 select * from A where exists (select 1 from B where B.id = A.id)

 #Equivalent  to:
 for(select * from A){ select * from B where B.id = A.id } #The ID field of table A and table B should be indexed 1, EXISTS (subquery) only returns TRUE or FALSE, Therefore, the sub-query SELECT * can also use SELECT 1 replacement, the official argument is that will the actual implementation ignore SELECT list, so there is no difference between the 2, actual implementation EXISTS sub-query may been optimized instead of our one by one on the understanding of the comparative 3 , EXISTS subquery can often be replaced by JOIN, which is the best need for specific analysis of specific issues







count(*) query optimization

# Temporarily close the mysql query cache, in order to view the real time of multiple executions of sql

set global query_cache_size=0;
set global query_cache_type=0;
EXPLAIN select count(1) from employees;
EXPLAIN select count(id) from employees;
EXPLAIN select count(name) from employees;
EXPLAIN select count(*) from employees;


The execution plans of the four SQLs are the same, indicating that the execution efficiency of these four SQLs should be similar. The difference is that the data rows whose fields are null values ​​will not be counted according to a certain field count.

Why did MySQL choose secondary indexes instead of primary key clustered indexes in the end? Because the secondary index stores less data than the primary key index, the retrieval performance should be higher.
Common optimization methods

1. Query the total number of rows maintained by mysql itself

For the myisam storage engine table, the performance of count query without where condition is very high, because the total number of rows of the myisam storage engine table will be stored on the disk by mysql, and the query does not need to be calculated


For the innodb storage engine table mysql does not store the total number of rows of the table, the query count needs to be calculated in real time.
2.
If you only need to know the estimated value of the total number of rows in the table, you can use the following SQL query, which has high performance


3. Maintain the total number in Redis. When
inserting or deleting table data rows, maintain the count value of the total number of table rows in redis (using incr or decr commands), but this method may be inaccurate and it is difficult to ensure table operation and Transaction consistency of redis operations
4. Increase the count table. When
inserting or deleting table data rows, maintain the count table at the same time so that they can operate in the same transaction

Guess you like

Origin blog.csdn.net/nmjhehe/article/details/113825736