MySQL advanced articles - sorting, grouping, paging optimization

navigation:

[Java Notes + Stepping on the Pit Summary] Java Basics + Advanced + JavaWeb + SSM + SpringBoot + St. Regis Takeaway + SpringCloud + Dark Horse Tourism + Guli Mall + Xuecheng Online + MySQL Advanced Chapter + Design Mode + Nioke Interview Questions

Table of contents

5. Sorting optimization

5.1 Sorting Optimization Suggestions

5.2 Testing

5.2.1 Case verification

5.3.2 Exercises

5.3 Index field selection in range query

5.4 filesort algorithm

5.4.1 Two-way sorting and one-way sorting

5.4.2 Tuning filesort

6. Group optimization

7. Paging query optimization

7.1 Deep paging query optimization

7.2 Deep paging optimization with sorting


5. Sorting optimization

5.1 Sorting Optimization Suggestions

Question : Add an index on the WHERE condition field, but why do you need to add an index on the ORDER BY field ?

In MySQL, two sorting methods are supported, namely FileSort and Index sorting. 

  • Index sorting: In index sorting, the index can ensure the order of the data, no need to sort, and it is more efficient. It is recommended to use it .
  • FileSort sorting: FileSort sorting is generally sorted in memory , which takes up more CPU . If the result to be sorted is large, temporary file I/O to disk for sorting will occur, which is inefficient.

Optimization suggestions:

  • The optimizer automatically selects the sorting method: MySQL supports index sorting and FileSort sorting. The index ensures the order of records and has high performance. It is recommended to use. FileSort sorting is in-memory sorting. When the amount of data is large, temporary files are generated and sorted on the disk. The efficiency is low and it takes up a lot of CPU. It's not that FileSort is necessarily inefficient, it may be efficient in some cases. For example, for left fuzzy and "not equal to" queries that do not cover the index, the efficiency of full table scanning is higher than that of index traversal and then returning to the table.
  • To meet the leftmost prefix: where conditions and order by fields create a joint index, the order needs to meet the leftmost prefix. For example, index (a,b,c), query where a=1 order by b,c.
  • The sorting index on the right side of the range query is invalid: for example, index (a,b,c), query where a>1 order by b,c, resulting in b,c sorting cannot use the index, and filesort is required.
  • Either all ascending or all descending: The sort order must be either all DESC or all ASC. Out of order will cause the index to fail.
  • When the number to be sorted is large, the index will become invalid: the amount of data to be sorted exceeds about 10,000, so the index will not be used and filesort will be used. It is recommended to use limit and where to filter to reduce the amount of data. When the amount of data is large, you need to go back to the table to check all the data after sorting the index, and the performance is very poor. It is not as efficient as FileSort for sorting in memory. It does not mean that using limit will definitely use index sorting. The key is the amount of data. When the amount of data is too large, the optimizer will use FileSort to sort.
  • Adding indexes to priority range fields: When the [range condition] and [group by or order by] fields appear to be optional, if there are enough filtered data but not many data to be sorted, the index is first placed on the range field . In this way, even if the range query causes the sorting index to fail, the efficiency is still higher than when only the sorting field is indexed. If you can only filter a little bit, then put the index on the sort field first.
  • Tuning FileSort: When Index sorting cannot be used, the FileSort method needs to be tuned. For example, increase sort_buffer_size (sort buffer size) and max_length_for_sort_data (maximum length of sorted data)

5.2 Testing

5.2.1 Case verification

Delete the created indexes in the student table and class table.

# 方式1
DROP INDEX idx_monitor ON class;
DROP INDEX idx_cid ON student;
DROP INDEX idx_age ON student;
DROP INDEX idx_name ON student;
DROP INDEX idx_age_name_classId ON student;
DROP INDEX idx_age_classId_name ON student;

# 方式2:call调用删除函数
call proc_drop_index('atguigudb2','student');

Can the following index be used, and can the using filesort be removed?

Go directly to filesort without indexing:

#索引失败。没有limit
EXPLAIN SELECT SQL_NO_CACHE * FROM student ORDER BY age,classid;

  

  

After adding the index, if the order by is not limited, the amount of data will be too large, and the index will become invalid:

CREATE INDEX idx_age_classid_name ON student(age,classId,name);
#索引失败。没有limit
EXPLAIN SELECT SQL_NO_CACHE * FROM student ORDER BY age,classid;

#索引成功,key_len为73
EXPLAIN SELECT SQL_NO_CACHE * FROM student ORDER BY age,classid LIMIT 10000;

Process 3: The order is wrong when order by, and the index is invalid

#创建索引age,classid,stuno
#call proc_drop_index('atguigudb2','student');
CREATE INDEX idx_age_classid_stuno ON student (age,classid,stuno);
#索引失效,不符合最左前缀
EXPLAIN SELECT * FROM student ORDER BY classid LIMIT 10;
#索引失效,不符合最左前缀
EXPLAIN SELECT * FROM student ORDER BY classid,name LIMIT 10;
#索引失效,不符合最左前缀
EXPLAIN SELECT * FROM student WHERE classid=1 ORDER BY age,stuno;
#全走索引,虽然不符合最左前缀,但因为查询量小,优化器先排序三个字段,再where找10个返回。
#优化器认为索引比filesort效率高,就用了索引
EXPLAIN SELECT * FROM student WHERE classid=1 ORDER BY age,stuno LIMIT 10;
#索引成功,符合最左前缀
EXPLAIN SELECT * FROM student ORDER BY age,classid,stuno LIMIT 10;
#索引成功,符合最左前缀
EXPLAIN SELECT * FROM student ORDER BY age,classid LIMIT 10;

Process 4: The rules are inconsistent when order by, and the index fails (the order is wrong, no indexing; the direction is reversed, no indexing)

Must match the leftmost prefix and "all ascending or all descending"

#创建索引age,classid,stuno
CREATE INDEX idx_age_classid_stuno ON student (age,classid,stuno);
#没符合“全升序或全降序”,索引失效
EXPLAIN SELECT * FROM student ORDER BY age DESC,classid ASC LIMIT 10;
#没符合最左前缀,索引失效
EXPLAIN SELECT * FROM student ORDER BY classid DESC,name DESC LIMIT 10;
#没符合“全升序或全降序”,索引失效
EXPLAIN SELECT * FROM student ORDER BY age ASC,classid DESC LIMIT 10;
#符合最左前缀,符合“全升序或全降序”,索引成功
EXPLAIN SELECT * FROM student ORDER BY age DESC,classid DESC LIMIT 10;

Process 5: If the amount of limit data is small, it is possible to go to the index if the leftmost prefix is ​​not satisfied. First sort and then where filter.

CREATE INDEX idx_age_classid_stuno ON student (age,classid,stuno);
CREATE INDEX idx_age_classid_name ON student(age,classId,name);

#都走了索引。
EXPLAIN SELECT * FROM student WHERE age=45 ORDER BY classid LIMIT 10;
#都走了索引。
EXPLAIN SELECT * FROM student WHERE age=45 ORDER BY classid,name;
#都没用索引,不符合最左前缀
EXPLAIN SELECT * FROM student WHERE classid=45 order by age;
#全走了索引。因为limit数据量小,优化器直接先用排序字段索引排序,然后再where筛选10个
EXPLAIN SELECT * FROM student WHERE classid=45 order by age limit 10;

The range search causes the index to fail: there are indexes (userDbid, addressDbid, createTime) below, userDbid, addressDbid go to the index, because addressDbid is a range search, causing the createTime index to fail.

 

 

5.3.2  Exercises

INDEX a_b_c(a,b,c)
order by 能使用索引最左前缀
- ORDER BY a
- ORDER BY a,b
- ORDER BY a,b,c
- ORDER BY a DESC,b DESC,c DESC
如果WHERE使用索引的最左前缀定义为常量,则order by 能使用索引
- WHERE a = const ORDER BY b,c
- WHERE a = const AND b = const ORDER BY c
- WHERE a = const ORDER BY b,c
- WHERE a = const AND b > const ORDER BY b,c
不能使用索引进行排序
- ORDER BY a ASC,b DESC,c DESC /* 排序不一致 */
- WHERE g = const ORDER BY b,c /*丢失a索引*/
- WHERE a = const ORDER BY c /*丢失b索引*/
- WHERE a = const ORDER BY a,d /*d不是索引的一部分*/
- WHERE a in (...) ORDER BY b,c /*对于排序来说,多个相等条件也是范围查询*/


 

5.3 Index field selection in range query

  1. mysql automatically selects the optimal solution: if two indexes exist at the same time, mysql automatically selects the optimal solution. (For this example, mysql chooses idx_age_stuno_name). However, as the amount of data changes, the selected index will also change accordingly.
  2. When the filtering ratio is high, add indexing to filter fields first: When the fields of [range condition] and [group by or order by] appear to choose one or the other, first observe the number of filters in the condition field. If there are enough filtered data, the data that needs to be sorted When there are not many, it is preferred to put the index on the range field. vice versa.

For the ORDER BY clause, try to use the Index method for sorting, and avoid using the FileSort method for sorting.

Before executing the case, clear the index on the student, leaving only the primary key:

DROP INDEX idx_age ON student;
DROP INDEX idx_age_classid_stuno ON student;
DROP INDEX idx_age_classid_name ON student;

#或者
call proc_drop_index('atguigudb2','student');

Scenario: Query students whose age is 30 and whose student number is less than 101000, sorted by user name

EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME ;

The query results are as follows:

mysql> SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME;
+---------+--------+--------+------+---------+
| id      | stuno  |  name  | age  | classId |
+---------+--------+--------+------+---------+
| 922     | 100923 | elTLXD | 30   | 249     |
| 3723263 | 100412 | hKcjLb | 30   | 59      |
| 3724152 | 100827 | iHLJmh | 30   | 387     |
| 3724030 | 100776 | LgxWoD | 30   | 253     |
| 30      | 100031 | LZMOIa | 30   | 97      |
| 3722887 | 100237 | QzbJdx | 30   | 440     |
| 609     | 100610 | vbRimN | 30   | 481     |
| 139     | 100140 | ZqFbuR | 30   | 351     |
+---------+--------+--------+------+---------+
8 rows in set, 1 warning (3.16 sec)

Conclusion: type is ALL, the worst case. Using filesort also appeared in Extra, which is also the worst case. Optimization is a must.

Solution 1: In order to remove filesort, we create an index, and the query efficiency is a little higher

#创建新索引
CREATE INDEX idx_age_name ON student(age,NAME);
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME;

Only age has gone index: 

In this way we optimized out using filesort

The query results are as follows:

Solution 2: Try to use the upper index for the filter conditions and sorting of where, and find that the query efficiency is higher

Build a combined index of three fields and find that using filesort still exists:

DROP INDEX idx_age_name ON student;
CREATE INDEX idx_age_stuno_name ON student (age,stuno,NAME);
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME;

age and stuno left the index: 

We found that using filesort still exists, so the name does not use the index, and the type is still range, just looking at the name is actually not good. The reason is that because stuno is a range filter , the fields behind the index will not use the index anymore.

The result is as follows:

mysql> SELECT SQL_NO_CACHE * FROM student
-> WHERE age = 30 AND stuno <101000 ORDER BY NAME ;
+-----+--------+--------+------+---------+
| id | stuno | name | age | classId |
+-----+--------+--------+------+---------+
| 167 | 100168 | AClxEF | 30 | 319 |
| 323 | 100324 | bwbTpQ | 30 | 654 |
| 651 | 100652 | DRwIac | 30 | 997 |
| 517 | 100518 | HNSYqJ | 30 | 256 |
| 344 | 100345 | JuepiX | 30 | 329 |
| 905 | 100906 | JuWALd | 30 | 892 |
| 574 | 100575 | kbyqjX | 30 | 260 |
| 703 | 100704 | KJbprS | 30 | 594 |
| 723 | 100724 | OTdJkY | 30 | 236 |
| 656 | 100657 | Pfgqmj | 30 | 600 |
| 982 | 100983 | qywLqw | 30 | 837 |
| 468 | 100469 | sLEKQW | 30 | 346 |
| 988 | 100989 | UBYqJl | 30 | 457 |
| 173 | 100174 | UltkTN | 30 | 830 |
| 332 | 100333 | YjWiZw | 30 | 824 |
+-----+--------+--------+------+---------+
15 rows in set, 1 warning (0.00 sec)

The result turned out to be that filesort ran faster than the index, and much faster, and the results appeared almost instantly.

reason:

All sorting is performed after conditional filtering. Therefore, if the conditions filter out most of the data, the remaining hundreds or thousands of data to be sorted is not very performance-intensive. Even if the index optimizes the sorting, the actual performance improvement is very limited. Relative to the condition of stuno<101000, if no index is used, tens of thousands of pieces of data must be scanned, which consumes a lot of performance. Therefore, placing an index on this field is the most cost-effective and the best choice

in conclusion:

  1. Two indexes exist at the same time, mysql automatically chooses the optimal solution . (For this example, mysql chooses idx_age_stuno_name). However, as the amount of data changes, the selected index will also change accordingly .
  2. When the fields of [range condition] and [group by or order by] appear, the number of filters in the condition field will be observed first. If there are enough filtered data and there are not many data to be sorted, the index will be placed in the range first. field. vice versa.

Thinking: Here we use the following index, is it feasible?

DROP INDEX idx_age_stuno_name ON student;

CREATE INDEX idx_age_stuno ON student(age,stuno);

sure.

5.4 filesort algorithm

5.4.1 Two-way sorting and one-way sorting

If the sorted field is not on the index column, filesort will have two algorithms: two-way sorting and one-way sorting

Two-way sort (slow)

  • Before MySQL 4.1 , two-way sorting was used, which literally means scanning the disk twice to finally get the data, read the row pointer and order by column, sort them, then scan the sorted list, and start from the list according to the values ​​in the list Read the corresponding data output in the list
  • Get the sorting field from the disk, sort it in the buffer, and then get other fields from the disk.

To fetch a batch of data, it is necessary to scan the disk twice. As we all know, IO is very time-consuming, so after mysql4.1, a second improved algorithm appeared, which is one-way sorting.

Single-way sorting (fast)

Read all the columns required by the query from the disk, sort them in the buffer according to the order by column, and then scan the sorted list for output. It is more efficient and avoids reading data for the second time. And turn random IO into sequential IO, but it will use more space, because it keeps each row in memory.

Conclusion and raised questions

  • Since the single-lane exits at the end, it is generally better than the dual-lane

  • But there is a problem with single channel

    • In sort_buffer, single-way takes up a lot more space than multi-way, because single-way takes out all the fields, so it is possible that the total size of the data taken out exceeds the capacity of sort_buffer, so that only data of the size of sort_buffer can be fetched each time , to sort (create tmp files, multi-way merge), after sorting, take the size of sort_buffer capacity, and then sort... thus multiple I/O.
    • The single channel originally wanted to save an I/O operation, but it resulted in a large number of I/O operations, and the gain outweighed the loss.

5.4.2 Tuning filesort

1. Try increasing sort_buffer_size

2. Try to increase max_length_for_sort_data

SHOW VARIABLES LIKE '%max_length_for_sort_data%';
#默认1924字节

Increasing this parameter will increase the probability of using the improved algorithm. But if it is set too high, the probability of the total data capacity exceeding the sort buffer size will increase. The obvious symptoms are high disk IO activity and low processor usage. If the total length of the columns to be returned is greater than max_length_for_sort data, use a two-way algorithm, otherwise use a single-way algorithm. Adjust between 1024-8192 bytes

3. When ordering by, select * is a taboo. It is best to only query the required fields.

  • When the sum of the field sizes of Query is less than max_ength_for_sort_data, and the sorting field is not of TEXTBLOB type, the improved algorithm - single-way sorting will be used, otherwise the old algorithm - multi-way sorting will be used.
  • The data of both algorithms may exceed the capacity of sort_bufer_size. After exceeding, tmp files will be created for merge sorting, resulting in multiple I/0s. However, the risk of using a single-way sorting algorithm will be greater, so sort_bufer_size should be increased.

6. Group optimization

  • Similar to sorting optimization: the principle of using indexes for group by is almost the same as that for order by, and group by can use indexes directly even if they do not use indexes for filter conditions.
  • Leftmost prefix: group by sorts first and then groups, following the best left prefix rule for index building
  • Tuning FileSort : When index columns cannot be used, increase the settings of max_length_for_sort_data and sort_buffer_size parameters
  • where is more efficient than having , and the conditions that can be written in where are limited should not be written in having. Where is filtering before grouping, and having is filtering after grouping.
  • Try not to sort groups and save cpu: reduce the use of order by, and communicate with business without sorting, or put the sorting on the terminal. Statements such as Order by, group by, and distinct consume more CPU , and the CPU resources of the database are extremely precious.
  • Use limit: Statements that include queries such as order by, group by, and distinct. Please keep the result set filtered by the where condition within 1000 rows, otherwise the SQL will be very slow.

7. Paging query optimization

7.1 Deep paging query optimization

In general pagination query, the performance can be better improved by creating a covering index .

Current problem: When the offset is very large, a large amount of useless data needs to be queried and then paginated, resulting in poor performance.

A common and very troublesome problem is limit 2000000,10 At this time, MySQL needs to sort the first 200000010 records, only return 2000000~2000010 records, other records are discarded, and the cost of query sorting is very high. And select * needs to return to the table, which is more time-consuming.

EXPLAIN SELECT * FROM student LIMIT 2000000,10; 

 Table with primary key auto-increment: directly check the 10 data after the range. A Limit query can be converted into a location query.

EXPLAIN SELECT * FROM student WHERE id > 2000000 LIMIT 10;

 Tables with non-incrementing primary keys: the primary key table after connection sorting interception in the current table , and the connection field is the primary key.

EXPLAIN SELECT * FROM student t,(SELECT id FROM student ORDER BY id LIMIT 2000000,10) a WHERE t.id = a.id;

You can also use subqueries, which are optimized into associated queries.

7.2 Deep paging optimization with sorting

Before optimization:  Query deep pagination arranged in reverse order according to age

EXPLAIN SELECT * FROM student order by age desc LIMIT 2000000,10; 

Optimization scheme 1:  The optimization idea is the same as before, the inner connection field is id

EXPLAIN SELECT * FROM student t1,(SELECT id FROM student ORDER BY age desc LIMIT 2000000,10) t2 WHERE t1.id=t2.id

Optimization scheme 2: If you turn pages sequentially, you can get the last record x of the previous page, then all record ids of the target page number are smaller than x.id (because the reverse order, and the sorting basis is actually age, id), the target page number All records with age less than or equal to x.age.

EXPLAIN SELECT * FROM student WHERE id<#{x.id} AND age>=#{x.age} ORDER BY age DESC LIMIT 10;

Guess you like

Origin blog.csdn.net/qq_40991313/article/details/130790354