When using limit's offset is relatively large, use sub-query to optimize sql and reduce back-to-table operations

When the offset of limit is relatively large, for example,
take 5 pieces of data from offer=30000
select * from t where a = '11' limit 30000,5

There is a financial flow table without database and table. The current data volume is 9555695. The limit is used for paging query. The query before optimization takes 16 s 938 ms (execution: 16 s 831 ms, fetching: 107 ms), according to After adjusting the SQL in the following way, it took 347 ms (execution: 163 ms, fetching: 184 ms);

Operation: Put the query conditions in the subquery. The subquery only checks the primary key ID, and then uses the primary key determined in the subquery to query other attribute fields;
Principle: Reduce back table operations;

I have never thought of the specific principle before, now I will add it, because in the subquery select id, where is the index, the value corresponding to the index key is exactly the primary key value (here the default id is the primary key value), so there is no need Back to the table, that is, to check the table again according to the id, so some data will be saved in the underlying cache of mysql.

MySQL is generally a clustered index structure of B+Tree. According to the primary key (usually id), you can find the required data directly on the leaf node, without returning to the table to query, it is very fast.

The combination of the two actually avoids the step of going back to the table query based on the primary key value corresponding to the index value

– SQL before optimization

SELECT  各种字段
FROM`table_name`
WHERE 各种条件
LIMIT 0,10;

– SQL after optimization

SELECT  各种字段
FROM`table_name` main_tale
RIGHT JOIN
(
SELECT  子查询只查主键
FROM`table_name`
WHERE 各种条件
LIMIT 0,10;
) temp_table ON temp_table.主键 = main_table.主键

First explain the version of MySQL:

mysql> selectversion();
+-----------+
| version() |
+-----------+
| 5.7.17    |
+-----------+
1 row in set (0.00 sec)

Table Structure:

mysql> desc test;
+--------+---------------------+------+-----+---------+----------------+
| Field  | Type                | Null | Key | Default | Extra          |
+--------+---------------------+------+-----+---------+----------------+
| id     | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| val    | int(10) unsigned    | NO   | MUL | 0       |                |
| source | int(10) unsigned    | NO   |     | 0       |                |
+--------+---------------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

id is an auto-incrementing primary key, and val is a non-unique index.

Inject a large amount of data, a total of 5 million:

mysql> selectcount(*) fromtest;
+----------+
| count(*) |
+----------+
|  5242882 |
+----------+
1 row in set (4.25 sec)

We know that when the offset in limit offset rows is very large, there will be efficiency problems:

mysql> select * fromtestwhere val=4limit300000,5;
+---------+-----+--------+
| id      | val | source |
+---------+-----+--------+
| 3327622 |   4 |      4 |
| 3327632 |   4 |      4 |
| 3327642 |   4 |      4 |
| 3327652 |   4 |      4 |
| 3327662 |   4 |      4 |
+---------+-----+--------+
5 rows in set (15.98 sec)

In order to achieve the same purpose, we will generally rewrite the following statement:

mysql> select * fromtest a innerjoin (selectidfromtestwhere val=4limit300000,5) b on a.id=b.id;
+---------+-----+--------+---------+
| id      | val | source | id      |
+---------+-----+--------+---------+
| 3327622 |   4 |      4 | 3327622 |
| 3327632 |   4 |      4 | 3327632 |
| 3327642 |   4 |      4 | 3327642 |
| 3327652 |   4 |      4 | 3327652 |
| 3327662 |   4 |      4 | 3327662 |
+---------+-----+--------+---------+
5 rows in set (0.38 sec)

The time difference is obvious.

Why does the above result appear? Let's take a look at the query process of select * from test where val=4 limit 300000,5;:

Query the index leaf node data. According to the primary key value on the leaf node, all the field values ​​needed to be queried on the clustered index are used.

Similar to the picture below:

Insert picture description here
As above, you need to query 300005 index nodes, query 300005 clustered index data, and finally filter the results out of the first 300,000 items, and take out the last 5 items. MySQL consumes a lot of random I/O to query the data of the clustered index, and the data of 300,000 random I/O queries will not appear in the result set.

Someone will definitely ask: Since the index is used at the beginning, why not first query the last required 5 nodes along the index leaf nodes, and then query the actual data in the clustered index. This only requires 5 random I/Os, similar to the process in the picture below:

Insert picture description here

Confirm that select * from test where val=4 limit 300000, 5 is to scan 300005 index nodes and 300005 clustered index data nodes

In order to verify that select * from test where val=4 limit 300000,5 scans 300005 index nodes and 300005 clustered index data nodes, we need to know whether MySQL has a way to count the query data nodes through index nodes in a SQL frequency. I tried the Handler_read_* series first, but unfortunately none of the variables can meet the conditions.

I can only confirm indirectly:

There is a buffer pool in InnoDB. There are data pages that have been accessed recently, including data pages and index pages. So we need to run two SQL to compare the number of data pages in the buffer pool. The prediction result is to run select * from test a inner join (select id from test where val=4 limit 300000,5); After that, the number of data pages in the buffer pool is far less than select * from test where val=4 limit 300000 ,5; The corresponding number, because the previous SQL only accesses the data page 5 times, and the latter SQL accesses the data page 300005 times.

select * fromtestwhere val=4limit300000,5
mysql> select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('val','primary') and TABLE_NAME like'%test%'groupby index_name;Empty set (0.04 sec)

It can be seen that there is currently no data page about the test table in the buffer pool.

mysql> select * fromtestwhere val=4limit300000,5;
+---------+-----+--------+
| id      | val | source |
+---------+-----+--------+|
3327622 |   4 |      4 |
| 3327632 |   4 |      4 |
| 3327642 |   4 |      4 |
| 3327652 |   4 |      4 |
| 3327662 |   4 |      4 |
+---------+-----+--------+
5 rows in set (26.19 sec)
mysql> select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('val','primary') and TABLE_NAME like'%test%'groupby index_name;
+------------+----------+
| index_name | count(*) |
+------------+----------+
| PRIMARY    |     4098 |
| val        |      208 |
+------------+----------+2 rows in set (0.04 sec)

It can be seen that there are 4,098 data pages and 208 index pages for the test table in the buffer pool at this time.

select * from test a inner join (select id from test where val=4 limit 300000,5); In order to prevent the impact of the last test, we need to clear the buffer pool and restart mysql.

mysqladmin shutdown
/usr/local/bin/mysqld_safe &
mysql> select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in(‘val’,‘primary’) and TABLE_NAME like’%test%'groupby index_name;

Empty set (0.03 sec) to
run sql:

mysql> select * fromtest a innerjoin (selectidfromtestwhere val=4limit300000,5) b on a.id=b.id;
+---------+-----+--------+---------+
| id      | val | source | id      |
+---------+-----+--------+---------+
| 3327622 |   4 |      4 | 3327622 |
| 3327632 |   4 |      4 | 3327632 |
| 3327642 |   4 |      4 | 3327642 |
| 3327652 |   4 |      4 | 3327652 |
| 3327662 |   4 |      4 | 3327662 |
+---------+-----+--------+---------+
5 rows in set (0.09 sec)
mysql> select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('val','primary') and TABLE_NAME like'%test%'groupby index_name;
+------------+----------+
| index_name | count(*) |
+------------+----------+
| PRIMARY    |        5 |
| val        |      390 |
+------------+----------+
2 rows in set (0.03 sec)

We can see the obvious difference between the two: the first SQL loads 4,098 data pages to the buffer pool, while the second SQL only loads 5 data pages to the buffer pool. In line with our forecast. It also proved why the first SQL was slow: read a large number of useless data rows (300000), and finally discarded it. And this will cause a problem: loading a lot of data pages that are not very hot in the buffer pool will cause the pollution of the buffer pool and occupy the space of the buffer pool. Problems encountered

In order to ensure that the buffer pool is cleared every time it is restarted, we need to turn off innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup. These two options can control the dump of the data in the buffer pool when the database is shut down and the data in the backup buffer pool that is loaded on disk when the database is turned on.

Guess you like

Origin blog.csdn.net/nmjhehe/article/details/113784048