Which one is more efficient between distinct and group by in MySQL?

Let me talk about the general conclusion first (the complete conclusion is at the end of the article):

1.在语义相同,有索引的情况下:group by和distinct都能使用索引,效率相同。

2.在语义相同,无索引的情况下:distinct效率高于group by。原因是distinct 和 group by都会进行分组操作,但group by可能会进行排序,触发filesort,导致sql执行效率低下。

Based on this conclusion, you might ask:

Why do group by and distinct have the same efficiency when the semantics are the same and there are indexes?

Under what circumstances does group by perform a sort operation?

Find answers to these two questions. Next, let's take a look at the basic usage of distinct and group by.

use of distinct

distinct usage

SELECT DISTINCT columns FROM table_name WHERE where_conditions;

For example:

mysql> select distinct age from student;

+------+

| age |

+------+

| 10 |

| 12 |

| 11 |

| NULL |

+------+

4 rows in set (0.01 sec)

The DISTINCT keyword is used to return uniquely distinct values. It is used before the first field in the query statement and acts on all columns of the main clause.

If a column has NULL values ​​and you use the DISTINCT clause on that column, MySQL will keep one NULL value and remove the other NULL values ​​because the DISTINCT clause treats all NULL values ​​as the same value.

distinct Multi-column deduplication

The deduplication of distinct multiple columns is performed according to the specified deduplication column information, that is, only if all the specified column information is the same, it will be considered as duplicate information.

SELECT DISTINCT column1,column2 FROM table_name WHERE where_conditions;

mysql> select distinct sex,age from student;

+--------+------+

| sex | age |

+--------+------+

| male | 10 |

| female | 12 |

| male | 11 |

| male | NULL |

| female | 11 |

+--------+------+

5 rows in set (0.02 sec)

Use of group by

For basic deduplication, the use of group by is similar to distinct:

Single Column Deduplication

grammar:

SELECT columns FROM table_name WHERE where_conditions GROUP BY columns;

implement:

mysql> select age from student group by age;

+------+

| age |

+------+

| 10 |

| 12 |

| 11 |

| NULL |

+------+

4 rows in set (0.02 sec)

Multi-column deduplication

grammar:

SELECT columns FROM table_name WHERE where_conditions GROUP BY columns;

implement:

mysql> select sex,age from student group by sex,age;

+--------+------+

| sex | age |

+--------+------+

| male | 10 |

| female | 12 |

| male | 11 |

| male | NULL |

| female | 11 |

+--------+------+

5 rows in set (0.03 sec)

example of difference

The grammatical difference between the two is that group by can deduplicate a single column. The principle of group by is to group and sort the results first, and then return the first piece of data in each group. And it is deduplicated according to the subsequent fields of group by.

For example:

mysql> select sex,age from student group by sex;

+--------+-----+

| sex | age |

+--------+-----+

| male | 10 |

| female | 12 |

+--------+-----+

2 rows in set (0.03 sec)

The principle of distinct and group by

In most examples, DISTINCT can be regarded as a special GROUP BY, and their implementations are based on grouping operations, and can be implemented through loose index scans and compact index scans.

Both DISTINCT and GROUP BY can use indexes for scan searches. For example, the following two sqls (just look at the contents of the extra at the end of the table), we analyze the two sqls, we can see that in the extras, these two sqls both use the compact index scan Using index for group-by.

Therefore, in general, for DISTINCT and GROUP BY statements with the same semantics, we can use the same index optimization method to optimize them.

mysql> explain select int1_index from test_distinct_groupby group by int1_index;

+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+

| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+

| 1 | SIMPLE | test_distinct_groupby | NULL | range | index_1 | index_1 | 5 | NULL | 955 | 100.00 | Using index for group-by |

+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+

1 row in set (0.05 sec)

mysql> explain select distinct int1_index from test_distinct_groupby;

+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+

| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+

| 1 | SIMPLE | test_distinct_groupby | NULL | range | index_1 | index_1 | 5 | NULL | 955 | 100.00 | Using index for group-by |

+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+

1 row in set (0.05 sec)

But for GROUP BY , before MYSQL8.0, GROUP Y defaults to implicit sorting by field.

As you can see, the following sql statement also performs filesort while using a temporary table.

mysql> explain select int6_bigger_random from test_distinct_groupby GROUP BY int6_bigger_random;

+----+-------------+-----------------------+------------+------+---------------+------+---------+------+-------+----------+---------------------------------+

| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

+----+-------------+-----------------------+------------+------+---------------+------+---------+------+-------+----------+---------------------------------+

| 1 | SIMPLE | test_distinct_groupby | NULL | ALL | NULL | NULL | NULL | NULL | 97402 | 100.00 | Using temporary; Using filesort |

+----+-------------+-----------------------+------------+------+---------------+------+---------+------+-------+----------+---------------------------------+

1 row in set (0.04 sec)

implicit ordering

For implicit sorting, we can refer to the official Mysql explanation:

https://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html

GROUP BY implicitly sorts by default (that is, in the absence of ASC or DESC designators for GROUP BY columns). However, relying on implicit GROUP BY sorting (that is, sorting in the absence of ASC or DESC designators) or explicit sorting for GROUP BY (that is, by using explicit ASC or DESC designators for GROUP BY columns) is deprecated. To produce a given sort order, provide an ORDER BY clause.

To roughly explain:

GROUP BY defaults to implicit sorting (meaning that sorting will be performed even if the GROUP BY column does not have an ASC or DESC indicator). However, GROUP BY for explicit or implicit ordering is deprecated, to generate a given sort order, provide an ORDER BY clause.

Therefore, before Mysql8.0, Group by will sort the results according to the role field (the field after Group by) by default. When the index can be used, Group by does not need additional sorting operations; but when the index sorting cannot be used, the Mysql optimizer has to choose to implement GROUP BY by using a temporary table and then sorting.

And when the size of the result set exceeds the size of the temporary table set by the system, Mysql will copy the temporary table data to the disk for operation, and the execution efficiency of the statement will become extremely low. This is why Mysql has chosen to deprecate this operation (implicit sort).

Based on the above reasons, Mysql has been optimized and updated in 8.0:

https://dev.mysql.com/doc/refman/8.0/en/order-by-optimization.html

Previously (MySQL 5.7 and lower), GROUP BY sorted implicitly under certain conditions. In MySQL 8.0, that no longer occurs, so specifying ORDER BY NULL at the end to suppress implicit sorting (as was done previously) is no longer necessary. However, query results may differ from previous MySQL versions. To produce a given sort order, provide an ORDER BY clause.

To roughly explain:

In the past (before Mysql5.7 version), Group by will perform implicit sorting according to certain conditions. In mysql 8.0, this feature has been removed, so it is no longer necessary to disable implicit sorting by adding order by null, however, query results may differ from previous MySQL versions. To generate results in a given order, specify the fields that need to be sorted by ORDER BY.

Therefore, our conclusion also came out:

In the case of the same semantics, with indexes:

Both group by and distinct can use indexes with the same efficiency. Because group by and distinct are almost equivalent, distinct can be regarded as a special group by.

With the same semantics, without indexes:

distinct is more efficient than group by. The reason is that both distinct and group by perform grouping operations, but group by performs implicit sorting before Mysql8.0, which causes filesort to be triggered and sql execution efficiency is low.

But since Mysql8.0, Mysql has deleted the implicit sorting. Therefore, at this time, the execution efficiency of group by and distinct is almost equivalent when the semantics are the same and there is no index.

Reasons for recommending group by

group by semantics are clearer

group by can perform more complex processing on data

Compared with distinct, group by has clear semantics. And because the distinct keyword will take effect on all fields, group by is more flexible when performing composite business processing. group by can perform more complex processing on data according to the grouping situation, such as filtering data by having , or operate on the data through aggregate functions.

Guess you like

Origin blog.csdn.net/Park33/article/details/129957544