MySQL——distinct and group by deduplication/loose index scan & compact index scan

This article introduces the difference between distinct and group by in MySQL, including usage, efficiency, and the concepts of loose index scan and compact index scan;

distinct Usage

Example:

SELECT DISTINCT columns FROM table_name WHERE where_conditions;

The DISTINCT keyword modifies the query column (can be multiple columns), which is used to return multiple unique column values;

DISTINCT deduplication of multiple columns is performed according to the specified deduplication column information, that is, only if all the specified column information is the same, it will be considered as duplicate information;

Special case: If a column has a NULL value, and the DISTINCT clause is used for the column, MySQL will keep one NULL value and delete the other NULL values, because the DISTINCT clause treats all NULL values ​​as the same value;

group by deduplication usage

In the function of deduplication , the use of group by is similar to that of distinct;

In addition to column deduplication, the main function of group by is the same as its semantics. Grouping data according to columns is generally used with aggregation functions for grouping statistics of data, such as summation, maximum value, average value, and counting;

Deduplication example:

SELECT columns FROM table_name WHERE where_conditions GROUP BY columns;

The difference between distinct and group by deduplication

DISTINCT is actually very similar to the implementation of GROUP BY operations. Their implementations are based on grouping operations, except that DISTINCT only takes out one record from each group after GROUP BY;

Both DISTINCT and GROUP BY can be scanned and searched using indexes. Therefore, in general, for DISTINCT and GROUP BY statements with the same semantics, we can use the same index optimization method to optimize them, that is, we can also use loose Index scan or compact index scan to achieve;

For DISTINCT, when DISTINCT cannot be completed by using only indexes, MySQL can only complete it through the temporary table Using temporary; MySQL will use the temporary table to do a "cache" of data, but it will not cache the data in the temporary table. data for filesort operation;

For GROUP BY, before MYSQL8.0, GROUP BY defaults to implicit sorting by field ;

Example:

# 表结构 无索引
CREATE TABLE `user_copy` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,
    `username` VARCHAR(32) NOT NULL COMMENT '用户名',
    `sex` CHAR(1) NULL DEFAULT NULL COMMENT '性别' COLLATE 'utf8_general_ci',
    `address` VARCHAR(255) NULL DEFAULT NULL COMMENT '地址' COLLATE 'utf8_general_ci',
    PRIMARY KEY (`id`) USING BTREE
)ENGINE=InnoDB;

# SQL执行计划
explain SELECT DISTINCT `username` FROM user_copy; 
explain SELECT `username` FROM user_copy GROUP BY `username`; 

result:

It can be seen that the above sql statement using group by also performs filesort while using a temporary table;

Implicit ordering of group by

For implicit sorting, we can refer to the official Mysql explanation:

https://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html

GROUP BY implicitly sorts by default (that is, in the absence of ASC or DESC designators for GROUP BY columns). However, relying on implicit GROUP BY sorting (that is, sorting in the absence of ASC or DESC designators) or explicit sorting for GROUP BY (that is, by using explicit ASC or DESC designators for GROUP BY columns) is deprecated. To produce a given sort order, provide an ORDER BY clause.

explain:

GROUP BY defaults to implicit sorting (referring to sorting even if the GROUP BY column does not have an ASC or DESC indicator); however, explicit or implicit sorting by GROUP BY is outdated (deprecated), and the given Sort order, please provide an ORDER BY clause;

Therefore, before Mysql8.0, Group by will sort the results according to the role field (the field after Group by) by default ; when the index can be used, the Group by does not need additional sorting operations; but when the index cannot be used When sorting, the Mysql optimizer has to choose to implement GROUP BY by using a temporary table and then sorting;

And when the size of the result set exceeds the size of the temporary table set by the system, Mysql will copy the data of the temporary table to the disk for operation, and the execution efficiency of the statement will become extremely low; this is why Mysql chooses this operation (implicit sorting) the reason for deprecation;

Based on the above reasons, Mysql has been optimized and updated in 8.0:

https://dev.mysql.com/doc/refman/8.0/en/order-by-optimization.html

Previously (MySQL 5.7 and lower), GROUP BY sorted implicitly under certain conditions. In MySQL 8.0, that no longer occurs, so specifying ORDER BY NULL at the end to suppress implicit sorting (as was done previously) is no longer necessary. However, query results may differ from previous MySQL versions. To produce a given sort order, provide an ORDER BY clause.

explain:

In the lower version of Mysql, Group by will perform implicit sorting according to certain conditions; in mysql 8.0, this function has been removed, so it is no longer necessary to disable implicit sorting by adding order by null, but the query results It may be different from previous MySQL versions; to generate results in a given order, please specify the fields that need to be sorted by ORDER BY ;

Summary of MySQL query deduplication

(1) In the case of the same semantics and index, group by and distinct have the same efficiency ;

Both group by and distinct can use indexes, and indexes are naturally ordered and can avoid sorting, so the efficiency of the two is the same; in this case, group by and distinct are almost equivalent, and distinct can be regarded as a special group by;

(2) In the case of the same semantics and no index, the efficiency of distinct is higher than that of group by;

The reason is that both distinct and group by perform grouping operations, but group by performs implicit sorting before Mysql8.0, which triggers filesort and lower execution efficiency;

But since Mysql8.0, Mysql has deleted the implicit sorting; so after Mysql8.0, the execution efficiency of group by and distinct is almost equivalent when the semantics are the same and there is no index;

(3) It is more recommended to use group by;

The semantics of group by are clearer, and more complex processing of data can be performed; the use of group by is more flexible, and group by can perform more complex processing on data according to the grouping situation, such as filtering data by having , or operate on data through aggregate functions;

Loose Index Scan & Compact Index Scan

A concept was mentioned above - loose index scan & compact index scan, let's make a brief introduction as a knowledge reserve;

When a query requires grouping, such as group by or distinct, you can use an index compatible with grouping to avoid scanning all data rows ; in order to improve scanning efficiency, MySQL introduces loose index scan, that is, loose index scan;

How do grouping operations take advantage of indexes?

Because innodb uses the index organization table based on B+ tree, the columns on the index satisfy the natural order. For the composite index, the composite key value is ordered; this feature can be used for index scanning different groups without scanning all the index column;

When MySQL fully uses index scanning to implement GROUP BY, it does not need to scan all the index keys that meet the conditions to complete the grouping operation. It is called loose index scan, which can minimize the ROWs that need to be scanned ;

The schematic diagram of Loose Index Scan is shown in the figure below:

As shown in the figure above, first query the first index record, and then query the next record with a different prefix until the last one; it can be seen that the number of scanned index key rows is the number of grouping groups, and many of the same prefix are skipped in the middle OK ;

When using loose index scan, the execution plan will display " Using index for group-by " in Extra ;

Why are loose index scans so efficient?

Because there is no WHERE clause, that is, when the full index scan must be performed, the number of key values ​​that need to be read by the loose index scan is as many as the number of grouped groups, that is to say, the number of key values ​​that actually exists is much less; and When the WHERE clause contains a range judgment or equivalent expression, the loose index scan looks for the first keyword of each group that meets the range condition, and reads the least possible number of keywords again;

What is a compact index?

The main difference between a compact index scan and a loose index scan is that when scanning an index, it needs to read all the index keys that meet the conditions, and then complete the GROUP BY operation based on all the data read to get the corresponding results, without skipping some index key ;

For example, in the where condition in the group by statement, the execution plan is different for index column equivalent query and range query, which are loose index and compact index respectively;

as follows:

# 表结构 联合索引(`c1`, `c2`, `c3`)
CREATE TABLE `t1` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,
    `c1` VARCHAR(255) NULL DEFAULT '' COLLATE 'utf8_general_ci',
    `c2` VARCHAR(255) NULL DEFAULT '' COLLATE 'utf8_general_ci',
    `c3` VARCHAR(255) NULL DEFAULT '' COLLATE 'utf8_general_ci',
    PRIMARY KEY (`id`) USING BTREE,
    INDEX `c` (`c1`, `c2`, `c3`) USING BTREE
)ENGINE=InnoDB;

# (1)SQL执行计划 "Using where; Using index" 使用紧凑索引扫描
explain SELECT DISTINCT `c1` FROM t1 WHERE `c2`>'B'; 

# (2)SQL执行计划 "Using where; Using index for group-by" 使用松散索引扫描
explain SELECT DISTINCT `c1` FROM t1 WHERE `c2`='B'; 

Explanation: Because for a composite index, the composite key values ​​are ordered; when doing an equivalence query of one of the index columns, some rows that do not meet the equivalence conditions can still be skipped when executing group by, and in this case, a loose index is used; But when doing range query, you need to find each row for range matching, so use compact index;

Execution plan example

Some examples of loose index, compact index and temporary table sorting are also given below; the table structure is the same as table t1 above;

# (1)SQL执行计划 "Using where; Using index" 使用紧凑索引扫描,索引列范围查询
explain SELECT DISTINCT `c1` FROM t1 WHERE `c2`>'B'; 

# (2)SQL执行计划 "Using where; Using index for group-by" 使用松散索引扫描,索引列等值查询
explain SELECT DISTINCT `c1` FROM t1 WHERE `c2`='B'; 

# (3)SQL执行计划 "Using index for group-by" 使用松散索引扫描
explain SELECT MIN(c2) from t1 group by c1;

# (4)SQL执行计划 "Using index" 使用min/max之外的其它聚集函数,则不能使用松散索引扫描,使用紧凑索引扫描
explain SELECT SUM(c2) from t1 group by c1;

# (5)SQL执行计划 "Using index for group-by" 使用松散索引扫描,满足索引前缀
explain SELECT `c1`,`c2` FROM t1 GROUP BY `c1`,`c2`; 

# (6)SQL执行计划 "Using where; Using index for group-by" 使用松散索引扫描,满足索引前缀、索引列等值查询
explain SELECT `c1`,`c2`,`c3` FROM t1  WHERE c3='C' GROUP BY `c1`,`c2`; 

# (7)SQL执行计划 "Using where; Using index for group-by" 使用松散索引扫描,虽然不满足索引前缀,但前缀中的列为常量
explain SELECT `c1`,`c2`,`c3` FROM t1  WHERE c1='C' GROUP BY `c1`,`c2`,`c3`; 

# (8)SQL执行计划 "Using index; Using temporary; Using filesort",使用临时表,不满足前缀索引,分组无法走索引,需要临时表并对分组内元素排序
explain SELECT `c2`,`c3` FROM t1 GROUP BY `c2`,`c3`; 

# (9)SQL执行计划 "Using index; Using temporary; Using filesort",使用临时表,不满足前缀索引,分组无法走索引,需要临时表并对分组内元素排序
explain SELECT `c1`,`c3` FROM t1 GROUP BY `c1`,`c3`; 

# (10)SQL执行计划 "Using where; Using index",使用紧凑索引扫描,却别于(9)使用临时表,尽管不满足前缀索引,但前缀中的列为常量
explain SELECT `c1`,`c3` FROM t1 WHERE c2='B' GROUP BY `c1`,`c3`; 

reference:

JD.com: Which one is more efficient, distinct or group by in MySQL?

Basic implementation principle of MySQL DISTINCT & example of compact/loose index scan

MySQL loose index scan and compact index scan

Guess you like

Origin blog.csdn.net/minghao0508/article/details/129783846