Mysql sorting principle_MySQL sorting principle and case analysis

foreword

Sorting is a fundamental function in databases, and MySQL is no exception. The user can achieve the purpose of sorting the specified result set through the Order by statement. In fact, not only the Order by statement, the Group by statement, and the Distinct statement all implicitly use sorting. This article first briefly introduces how SQL uses indexes to avoid sorting costs, then introduces the internal principles of MySQL to achieve sorting, and introduces the parameters related to sorting. Finally, it will give a few "strange" sorting examples to talk about sorting consistency issues , and explain the essential cause of the phenomenon.

1. Sorting optimization and index usage

In order to optimize the sorting performance of SQL statements, it is best to avoid sorting and make reasonable use of indexes. Because the index itself is also ordered, if a suitable index is established on the field that needs to be sorted, then the sorting process can be skipped and the query speed of SQL can be improved. Below I will use some typical SQL to illustrate which SQL can use index to reduce sorting and which SQL cannot. Assume that the t1 table has indexes key1(key_part1, key_part2), key2(key2)

a. SQL that can use indexes to avoid sorting

SELECT * FROM t1 ORDER BY key_part1,key_part2;

SELECT * FROM t1 WHERE key_part1 = constant ORDER BY key_part2;

SELECT * FROM t1 WHERE key_part1 > constant ORDER BY key_part1 ASC;

SELECT * FROM t1 WHERE key_part1 = constant1 AND key_part2 > constant2 ORDER BY key_part2;

b. SQL that cannot use indexes to avoid sorting

//The sorting field is in multiple indexes, and the index sorting cannot be used

SELECT * FROM t1 ORDER BY key_part1,key_part2, key2;

//The sort key order is inconsistent with the column order in the index, and the index sort cannot be used

SELECT * FROM t1 ORDER BY key_part2, key_part1;

//The ascending and descending order are inconsistent, and the index sorting cannot be used

SELECT * FROM t1 ORDER BY key_part1 DESC, key_part2 ASC;

//key_part1 is a range query, key_part2 cannot use index sorting

SELECT * FROM t1 WHERE key_part1> constant ORDER BY key_part2;

2. Algorithm for sorting implementation

For SQL that cannot use indexes to avoid sorting, the database has to implement the sorting function itself to meet user needs. At this time, "Using filesort" will appear in the SQL execution plan. It should be noted here that filesort does not mean file sorting. In fact, There may also be memory sorting, which is mainly determined by the sort_buffer_size parameter and the size of the result set. There are three main ways to implement sorting within MySQL, conventional sorting, optimized sorting, and priority queue sorting, which mainly involve three sorting algorithms: quick sorting, merge sorting, and heap sorting. Suppose the table structure and SQL statement are as follows:

CREATE TABLE t1(id int, col1 varchar(64), col2 varchar(64), col3 varchar(64), PRIMARY KEY(id),key(col1,col2));

SELECT col1,col2,col3 FROM t1 WHERE col1>100 ORDER BY col2;

a. Conventional sorting

(1). Obtain records satisfying the WHERE condition from table t1

(2). For each record, take out the primary key + sort key (id, col2) of the record and put it into the sort buffer

(3). If the sort buffer can store all (id, col2) pairs that meet the conditions, sort them; otherwise, when the sort buffer is full, sort them and store them in a temporary file. (The sorting algorithm uses the quick sort algorithm)

(4). If a temporary file is generated during sorting, the merge sort algorithm needs to be used to ensure that the records in the temporary file are in order

(5). Execute the above process in a loop until all records that meet the conditions participate in the sorting

(6).扫描排好序的(id,col2)对,并利用id去捞取SELECT需要返回的列(col1,col2,col3)

(7).将获取的结果集返回给用户。

从上述流程来看,是否使用文件排序主要看sort buffer是否能容下需要排序的(id,col2)对,这个buffer的大小由sort_buffer_size参数控制。此外一次排序需要两次IO,一次是捞(id,col2),第二次是捞(col1,col2,col3),由于返回的结果集是按col2排序,因此id是乱序的,通过乱序的id去捞(col1,col2,col3)时会产生大量的随机IO。对于第二次MySQL本身一个优化,即在捞之前首先将id排序,并放入缓冲区,这个缓存区大小由参数read_rnd_buffer_size控制,然后有序去捞记录,将随机IO转为顺序IO。

b.优化排序

常规排序方式除了排序本身,还需要额外两次IO。优化的排序方式相对于常规排序,减少了第二次IO。主要区别在于,放入sort buffer不是(id,col2),而是(col1,col2,col3)。由于sort buffer中包含了查询需要的所有字段,因此排序完成后可以直接返回,无需二次捞数据。这种方式的代价在于,同样大小的sort buffer,能存放的(col1,col2,col3)数目要小于(id,col2),如果sort buffer不够大,可能导致需要写临时文件,造成额外的IO。当然MySQL提供了参数max_length_for_sort_data,只有当排序元组小于max_length_for_sort_data时,才能利用优化排序方式,否则只能用常规排序方式。

c.优先队列排序

为了得到最终的排序结果,无论怎样,我们都需要将所有满足条件的记录进行排序才能返回。那么相对于优化排序方式,是否还有优化空间呢?5.6版本针对Order by limit M,N语句,在空间层面做了优化,加入了一种新的排序方式--优先队列,这种方式采用堆排序实现。堆排序算法特征正好可以解limit M,N 这类排序的问题,虽然仍然需要所有元素参与排序,但是只需要M+N个元组的sort buffer空间即可,对于M,N很小的场景,基本不会因为sort buffer不够而导致需要临时文件进行归并排序的问题。对于升序,采用大顶堆,最终堆中的元素组成了最小的N个元素,对于降序,采用小顶堆,最终堆中的元素组成了最大的N的元素。

3.排序不一致问题

案例1

Mysql从5.5迁移到5.6以后,发现分页出现了重复值。

测试表与数据:

create table t1(id int primary key, c1 int, c2 varchar(128));

insert into t1 values(1,1,'a');

insert into t1 values(2,2,'b');

insert into t1 values(3,2,'c');

insert into t1 values(4,2,'d');

insert into t1 values(5,3,'e');

insert into t1 values(6,4,'f');

insert into t1 values(7,5,'g');

假设每页3条记录,第一页limit 0,3和第二页limit 3,3查询结果如下:

47480d46863f809b883701597c377709.png

我们可以看到 id为4的这条记录居然同时出现在两次查询中,这明显是不符合预期的,而且在5.5版本中没有这个问题。产生这个现象的原因就是5.6针对limit M,N的语句采用了优先队列,而优先队列采用堆实现,比如上述的例子order by c1 asc limit 0,3 需要采用大小为3的大顶堆;limit 3,3需要采用大小为6的大顶堆。由于c1为2的记录有3条,而堆排序是非稳定的(对于相同的key值,无法保证排序后与排序前的位置一致),所以导致分页重复的现象。为了避免这个问题,我们可以在排序中加上唯一值,比如主键id,这样由于id是唯一的,确保参与排序的key值不相同。将SQL写成如下:

select * from t1 order by c1,id asc limit 0,3;

select * from t1 order by c1,id asc limit 3,3;

案例2

两个类似的查询语句,除了返回列不同,其它都相同,但排序的结果不一致。

测试表与数据:

create table t2(id int primary key, status int, c1 varchar(255),c2 varchar(255),c3 varchar(255),key(c1));

insert into t2 values(7,1,'a',repeat('a',255),repeat('a',255));

insert into t2 values(6,2,'b',repeat('a',255),repeat('a',255));

insert into t2 values(5,2,'c',repeat('a',255),repeat('a',255));

insert into t2 values(4,2,'a',repeat('a',255),repeat('a',255));

insert into t2 values(3,3,'b',repeat('a',255),repeat('a',255));

insert into t2 values(2,4,'c',repeat('a',255),repeat('a',255));

insert into t2 values(1,5,'a',repeat('a',255),repeat('a',255));

分别执行SQL语句:

select id,status,c1,c2 from t2 force index(c1) where c1>='b' order by status;

select id,status from t2 force index(c1) where c1>='b' order by status;

执行结果如下:

558c510867704f60dc1a143b73b5b6fa.png

看看两者的执行计划是否相同

6f90662294e575656770b56003dba760.png

为了说明问题,我在语句中加了force index的hint,确保能走上c1列索引。语句通过c1列索引捞取id,然后去表中捞取返回的列。根据c1列值的大小,记录在c1索引中的相对位置如下:

(c1,id)===(b,6),(b,3),(5,c),(c,2),对应的status值分别为2 3 2 4。从表中捞取数据并按status排序,则相对位置变为(6,2,b),(5,2,c),(3,3,c),(2,4,c),这就是第二条语句查询返回的结果,那么为什么第一条查询语句(6,2,b),(5,2,c)是调换顺序的呢?这里要看我之前提到的a.常规排序和b.优化排序中标红的部分,就可以明白原因了。由于第一条查询返回的列的字节数超过了max_length_for_sort_data,导致排序采用的是常规排序,而在这种情况下MYSQL将rowid排序,将随机IO转为顺序IO,所以返回的是5在前,6在后;而第二条查询采用的是优化排序,没有第二次捞取数据的过程,保持了排序后记录的相对位置。对于第一条语句,若想采用优化排序,我们将max_length_for_sort_data设置调大即可,比如2048。

65c0b60ac1348342635cca29c420fb1c.png

4.参考文档

http://dev.mysql.com/doc/refman/5.6/en/order-by-optimization.html

http://mysql.taobao.org/monthly/2015/06/04/

http://ifxoxo.com/mysql_order_by.html

到此这篇关于MySQL排序原理和案例详析的文章就介绍到这了,更多相关MySQL排序原理和案例内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们!

本文标题: MySQL排序原理和案例详析

本文地址: http://www.cppcns.com/shujuku/mysql/368132.html

Guess you like

Origin blog.csdn.net/weixin_35702787/article/details/113224866