Simple and easy to understand, about the execution process and optimization of Group by

foreword

Hello everyone, I'm a little boy who picks up snails .

In daily development, we often use group by. Dear friends, do you know how group by works? What is the difference between group by and having? What is the optimization idea of ​​group by? What are the issues that need to be paid attention to when using group by? This article will learn with you and conquer group by~

  • Simple example using group by
  • How group by works
  • The difference between group by + where and group by + having
  • group by optimization ideas
  • Notes on using group by
  • How to optimize a production slow SQL

1. Simple example using group by

Group by is generally used for grouping statistics , and the logic it expresses is to group according to certain rules. Let's start with a simple example and review it together.

Suppose an employee table is used, and the table structure is as follows:

CREATE TABLE `staff` (
  `id` bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主键id',
  `id_card` varchar(20) NOT NULL COMMENT '身份证号码',
  `name` varchar(64) NOT NULL COMMENT '姓名',
  `age` int(4) NOT NULL COMMENT '年龄',
  `city` varchar(64) NOT NULL COMMENT '城市',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT='员工表';
复制代码

The table inventory data is as follows:

We now have such a requirement: count the number of employees in each city . The corresponding SQL statement can be written as follows:

select city ,count(*) as num from staff group by city;
复制代码

The execution result is as follows:

The logic of this SQL statement is very clear, but what is its underlying execution flow?

2. Group by principle analysis

2.1 explain analysis

Let's first use explain to view the execution plan

explain select city ,count(*) as num from staff group by city;
复制代码

  • The Using temporary of the Extra field indicates that the temporary table is used when the grouping is performed.
  • The Using filesort of the Extra field indicates that sorting is used

How does group by use temporary tables and sorting? Let's take a look at the execution flow of this SQL

2.2 Simple execution process of group by

explain select city ,count(*) as num from staff group by city;
复制代码

Let's take a look at the execution process of this SQL.

  1. Create a memory temporary table with two fields city and num;
  2. The full table scans the records of staff, and sequentially retrieves records with city = 'X'.
  • Determine whether there is a row with city='X' in the temporary table , if not, insert a record (X,1);
  • If there is a row with city='X' in the temporary table, add 1 to the num value of the row of x;
  1. After the traversal is completed, sort according to the field city to get the result set and return it to the client.

The execution diagram of this process is as follows:

What is the ordering of the temporary table?

It is to put the fields that need to be sorted into the sort buffer, and return after sorting. Pay attention here, the sorting is divided into full field sorting and rowid sorting

If it is a full field sorting, the fields that need to be queried and returned are put into the sort buffer, and they are sorted according to the sorting fields , and they are returned directly . . How to determine whether to use full field sorting or rowid sorting? Controlled by a database parameter, max_length_for_sort_data

For those who are interested in learning more about sorting, you can read my article.

  • Read it once to understand: order by detailed explanation

3. The difference between where and having

  • Execution process of group by + where
  • Execution process of group by + having
  • At the same time, there are the execution order of where, group by, and having

3.1 Execution process of group by + where

Some friends feel that the SQL in the previous section is too simple. If the where condition is added, and the where condition column is indexed, what is the execution process ?

Ok, let's add a condition to it, and add an index of idx_age, as follows:

select city ,count(*) as num from staff where age> 30 group by city;
//加索引
alter table staff add index idx_age (age);
复制代码

Let's analyze again:

explain select city ,count(*) as num from staff where age> 30 group by city;
复制代码

From the explain execution plan results, it can be found that the query condition hits the index of idx_age, and uses temporary tables and sorting

Using index condition : Indicates that the index is pushed down to optimize, filter data as much as possible according to the index, and then return it to the server layer to filter according to other conditions. Why is there an index pushdown for a single index here? The appearance of explain does not necessarily mean that index pushdown is used, it just means that it can be used, but it is not necessarily used. If you have any ideas or questions, you can add me on WeChat to discuss.

The execution flow is as follows:

  1. Create a memory temporary table with two fields city and num;
  2. Scan the index tree idx_age to find the primary key ID whose age is greater than 30
  3. Through the primary key ID, go back to the table to find city = 'X'
  • Determine whether there is a row with city='X' in the temporary table , if not, insert a record (X,1);
  • If there is a row with city='X' in the temporary table, add 1 to the num value of the row of x;
  1. Continue to repeat steps 2 and 3 to find all the data that meet the conditions,
  2. Finally, sort according to the field city , and get the result set and return it to the client.

3.2 Execution of group by + having

If you want to query the number of employees in each city, and get the cities where the number of employees is not less than 3, having can solve your problem very well. SQL Jiangzi wrote:

select city ,count(*) as num from staff  group by city having num >= 3;
复制代码

The query results are as follows:

having is called a grouping filter condition, which operates on the returned result set.

3.3 Execution order of where, group by, and having at the same time

If a SQL contains where, group by, and having clauses at the same time, what is the execution order?

For example this SQL:

select city ,count(*) as num from staff  where age> 19 group by city having num >= 3;
复制代码
  1. Execute where clause to find employee data whose age is greater than 19
  2. group by clause on employee data, grouped by city.
  3. For the city groups formed by the group by clause, run the aggregate function to calculate the number of employees in each group;
  4. Finally, use the having clause to select the city group with the number of employees greater than or equal to 3.

3.4 where + having difference summary

  • The having clause is used for filtering after grouping , and the where clause is used for row condition filtering
  • Having generally appears with group by and aggregation functions such as (count(), sum(), avg(), max(), min())
  • Aggregate functions cannot be used in the where condition clause, but the having clause can.
  • having can only be used after group by, where is executed before group by

4. Problems with group by

The main points to note when using group by are:

  • Does group by have to be used with aggregate functions?
  • The field of group by must appear in the select
  • Slow SQL problem caused by group by

4.1 Does group by have to be used with aggregate functions?

group by means grouping statistics . Generally, it is used with aggregate functions such as (count(), sum(), avg(), max(), min()).

  • count() number
  • sum() sum
  • avg() average
  • max() maximum value
  • min() minimum value

Can it be used without an aggregate function?

I am using Mysql 5.7 and it is ok. No error will be reported, and what is returned is the first row of data of the group.

For example this SQL:

select city,id_card,age from staff group by  city;
复制代码

The query result is

Let's compare it, what is returned is the first data of each group

Of course, when you usually use it, group by is still used in conjunction with aggregation functions, unless there are some special scenarios, such as you want to remove duplicates, of course, it is also possible to reuse distinct.

4.2 The fields followed by group by must appear in select.

Not necessarily, such as the following SQL:

select max(age)  from staff group by city;
复制代码

The execution result is as follows:

The grouping field city is not behind select and will not report an error. Of course, this may be related to different databases and different versions . When you use it, you can verify it first. There is a saying that, what you get on paper will be shallow, and you will never know what to do .

4.3 Slow SQL problems caused by group by

To the most important point of attention, improper use of group by can easily cause slow SQL problems. Because it uses both temporary tables and sorting by default . Sometimes disk temporary tables may also be used .

If the size of the memory temporary table reaches the upper limit during the execution process (the parameter controlling this upper limit is tmp_table_size), the memory temporary table will be converted into a disk temporary table . If the amount of data is large, it is likely that the disk temporary table required by this query will take up a lot of disk space.

These are all x factors that lead to slow SQL. Let's discuss optimization solutions together.

5. Some optimization schemes of group by

In which direction to optimize?

  • Direction 1: Since it will be sorted by default, let's not rank it.
  • Direction 2: Since the temporary table is the X factor that affects the performance of group by, can we not use the temporary table?

Let's think about it together, why do you need a temporary table to execute a group by statement? The semantic logic of group by is to count the number of occurrences of different values. If these values ​​are in order from the beginning , can we just scan the statistics directly, instead of using a temporary table to record and count the results ?

  • The field after group by is indexed
  • order by null without sorting
  • Try to use only in-memory temporary tables
  • Use SQL_BIG_RESULT

5.1 Add index to the field after group by

How to ensure that the field values ​​after group by are in order from the beginning? Of course it's indexing .

Let's go back to this SQL

select city ,count(*) as num from staff where age= 19 group by city;
复制代码

its execution plan

If we add a joint index to it idx_age_city (age, city)

alter table staff add index idx_age_city(age,city);
复制代码

Looking at the execution plan again, I found that neither sorting nor temporary tables are needed.

Adding a suitable index is the easiest and most effective way to optimize group by.

5.2 order by null without sorting

Not all scenarios are suitable for indexing. If we encounter a scenario that is not suitable for creating an index, how can we optimize it?

If your needs do not require sorting the result set, you can use order by null.

select city ,count(*) as num from staff group by city order by null
复制代码

The execution plan is as follows, there is no filesort anymore

5.3 Try to use only in-memory temporary tables

If there is not much data to be counted by group by, we can try to use only memory temporary tables as much as possible ; because if the process of group by cannot fit the data, it is time-consuming to use disk temporary tables. Therefore, the tmp_table_size parameter can be appropriately increased to avoid the use of disk temporary tables .

5.4 Optimizing with SQL_BIG_RESULT

What if the amount of data is too large? Can't increase tmp_table_size infinitely? But you can't just watch the data put into the memory temporary table first , and then turn it into a disk temporary table as data insertion finds that the upper limit is reached? That's a bit unintelligent.

Therefore, if the estimated data volume is relatively large, we use the SQL_BIG_RESULT hint to directly use the disk temporary table. The MySQl optimizer found that the disk temporary table is stored in a B+ tree, and the storage efficiency is not as high as that of an array. Therefore, it will be directly stored in an array

An example SQl is as follows:

select SQL_BIG_RESULT city ,count(*) as num from staff group by city;
复制代码

As you can see from the Extra field of the execution plan, the execution does not use temporary tables, but only sorts

The execution flow is as follows:

  1. Initialize sort_buffer and put it in the city field;
  2. Scan the table staff, take out the values ​​of city in turn, and store them in sort_buffer;
  3. After the scan is complete, sort the city field of sort_buffer
  4. After sorting is complete, an ordered array is obtained.
  5. According to an ordered array, count the number of occurrences of each value.

6. How to optimize a production slow SQL

Recently I encountered a slow SQL production, related to group by, let me show you how to optimize it.

The table structure is as follows:

CREATE TABLE `staff` (
  `id` bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主键id',
  `id_card` varchar(20) NOT NULL COMMENT '身份证号码',
  `name` varchar(64) NOT NULL COMMENT '姓名',
  `status` varchar(64) NOT NULL COMMENT 'Y-已激活 I-初始化 D-已删除 R-审核中',
  `age` int(4) NOT NULL COMMENT '年龄',
  `city` varchar(64) NOT NULL COMMENT '城市',
  `enterprise_no` varchar(64) NOT NULL COMMENT '企业号',
  `legal_cert_no` varchar(64) NOT NULL COMMENT '法人号码',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT='员工表';
复制代码

The SQL for the query is this:

select * from t1 where status = #{status} group by #{legal_cert_no}
复制代码

Let's not discuss whether the = of this SQL is reasonable. If it was such an SQL, how would you optimize it? Friends who have ideas can leave a message to discuss, or you can add me to WeChat and group discussions. If you think the article is wrong, you can also bring it up, let's make progress together, come on!

Guess you like

Origin blog.csdn.net/wdjnb/article/details/124403974