Read it once and understand: Database group by detailed explanation

foreword

Hello everyone, I am programmer Xiaohui .

In daily development, we often use it group by. Dear friends, do you know group byhow it works? group byand havingwhat is the difference? group byWhat is your optimization idea? What are the problems that need to be paid attention to when using it group by? This article will study with you and conquer group by~

  • Simple example of using group by

  • How group by works

  • The difference between group by + where and having

  • group by optimization ideas

  • Notes on using group by

  • How to optimize a production slow SQL

1. A simple example of using group by

group byGenerally used for group statistics , the logic it expresses is 根据一定的规则,进行分组. Let's start with a simple example and review it together.

Assuming an employee table is used, the table structure is as follows:

CREATE TABLE `staff` (
  `id` bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主键id',
  `id_card` varchar(20) NOT NULL COMMENT '身份证号码',
  `name` varchar(64) NOT NULL COMMENT '姓名',
  `age` int(4) NOT NULL COMMENT '年龄',
  `city` varchar(64) NOT NULL COMMENT '城市',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT='员工表';

The table inventory data is as follows:

eb097ff0321f196a3a4fa436efa6e87e.png

We now have such a need: count the number of employees in each city . The corresponding SQL statement can be written like this:

select city ,count(*) as num from staff group by city;

The execution results are as follows:

09afbff2a2316f85733e0dd486c7f164.png

The logic of this SQL statement is very clear, but what is its underlying execution process?

2. Group by principle analysis

2.1 explain analysis

explainLet's take a look at the execution plan first

explain select city ,count(*) as num from staff group by city;
6b56ea6d83ca292e57fa5841ad0eb06b.png
  • Extra This field indicates that a temporary tableUsing temporary is used when performing grouping

  • Extra The representation of this field Using filesortuses sorting

group byHow did you use it 临时表和排序? Let's take a look at the execution flow of this SQL

2.2 Simple execution process of group by

explain select city ,count(*) as num from staff group by city;

Let's take a look at the execution process of this SQL.

  1. Create a temporary memory table with two fields cityand num;

  2. For the records of the full table scan staff, the records with city = 'X' are taken out in sequence.

  • Determine whether there is a row with city='X' in the temporary table , if not, insert a record (X,1);

  • If there is a row with city='X' in the temporary table, add 1 to the num value of the row of x;

  1. After the traversal is completed, sort according to the field city, and return the result set to the client.

The execution diagram of this process is as follows:

889bd41c5258e6e530660f94a9a62c6f.png

What is the sorting of the temporary table?

It is to put the fields that need to be sorted into the sort buffer, and return after sorting. Pay attention here, sorting is divided into full field sorting and rowid sorting

  • If yes 全字段排序, put all the fields that need to be returned by the query , and return them directly sort bufferaccording to the sorting fields

  • If so rowid排序, just put in the fields that need to be sorted , and then return to the tablesort buffer one more time , and then return.

  • How to determine whether to sort by all fields or by rowid? controlled by a database parameter,max_length_for_sort_data

3. The difference between where and having

  • Execution flow of group by + where

  • Execution flow of group by + having

  • At the same time, there are execution sequences of where, group by, and having

3.1 Execution process of group by + where

Some friends think that the SQL in the previous section is too simple. If the where condition is added and the where condition column is indexed, what is the execution process ?

Ok, let's add a condition to it, and add an idx_ageindex, as follows:

select city ,count(*) as num from staff where age> 30 group by city;
//加索引
alter table staff add index idx_age (age);

Let's explain again:

explain select city ,count(*) as num from staff where age> 30 group by city;
10a7f98c0640c89b142956f6c8b69012.png

From the result of the explain execution plan, you can find idx_agethe index that the query condition hits, and use临时表和排序

Using index condition : Indicates index pushdown optimization, filtering data as much as possible according to the index, and then returning to the server layer to filter according to where and other conditions. Why is there an index pushdown for a single index here? The appearance of explain does not mean that the index pushdown must be used, it just means that it can be used, but it is not necessarily used. If you have any ideas or questions, you can add me to WeChat to discuss.

The execution process is as follows:

  1. Create a temporary memory table with two fields cityand num;

  2. Scan the index tree idx_ageto find primary key IDs with an age greater than 30

  3. Through the primary key ID, return to the table to find city = 'X'

  • Determine whether there is a row with city='X' in the temporary table , if not, insert a record (X,1);

  • If there is a row with city='X' in the temporary table, add 1 to the num value of the row of x;

  1. Continue to repeat steps 2 and 3 to find all data that meets the conditions,

  2. Finally, sort according to the field city, get the result set and return it to the client.

3.2 Execution of group by + having

If you want to query the number of employees in each city and get the cities with no less than 3 employees, having can solve your problem very well. SQL sauce writes:

select city ,count(*) as num from staff  group by city having num >= 3;

The query results are as follows:

0a1d54920b63942ee5b20566c2c5ad46.pnghavingCalled the grouping filter, it operates on the returned result set.

3.3 Execution order of where, group by, and having

If a SQL contains where、group by、havingclauses at the same time, what is the order of execution?

For example this SQL:

select city ,count(*) as num from staff  where age> 19 group by city having num >= 3;
  1. Execute wherethe clause to find employee data whose age is greater than 19

  2. group byclause on employee data, grouped by city.

  3. For group bythe city groups formed by clauses, run the aggregate function to calculate the number of employees in each group;

  4. Finally, use havingthe clause to select the city group with the number of employees greater than or equal to 3.

3.4 Where + having difference summary

  • havingThe clause is used for filtering after grouping , and the where clause is used for row condition filtering

  • havingIt is usually combined group bywith aggregate functions such as ( count(),sum(),avg(),max(),min())

  • whereAggregate functions cannot be used in condition clauses, whereas havingclauses can.

  • havingIt can only be used after group by, where executes before group by

4. Problems to be noticed when using group by

There are several points to note when using group by:

  • group byMust it be used together with aggregate functions?

  • group byThe field must appear in the select

  • group byThe slow SQL problem caused by

4.1 Must group by be used with aggregation functions?

group by means group statistics , and it is generally 如(count(),sum(),avg(),max(),min())used together with aggregation functions.

  • count() number

  • sum() sum

  • avg() average

  • max() maximum value

  • min() minimum value

Can it be used without an aggregate function?

I use Mysql 5.7 , it is possible. No error will be reported, and what is returned is the first row of data grouped.

For example this SQL:

select city,id_card,age from staff group by  city;

The query result is

ab06549164fb513f370789f8217e3ca9.png

Let's compare and see, what is returned is the first data of each group4e8d30b971fcc380c14c4d7363f793d8.png

Of course, when you usually use it, group by is still used with aggregation functions, unless there are some special scenarios, such as you want todistinct duplication, of course , it is also possible to reuse .

4.2 The fields following group by must appear in the select.

Not necessarily, such as the following SQL:

select max(age)  from staff group by city;

The execution results are as follows:

0a1b42877dbdf97c67d2cef031f81f91.png

The grouping field cityis not behind the select, and no error will be reported. Of course, this may be related to different databases and different versions . When you use it, you can verify it first. There is a saying that what is achieved on paper will eventually become superficial, but you will never know that you have to do it yourself .

4.3 group bySlow SQL problems caused by

Now comes the most important point of attention. group byImproper use can easily lead to slow SQL problems. Because it uses both temporary tables and sorting by default . Sometimes disk temporary tables may also be used .

  • If during execution, you will find that the size of the temporary memory table has reached the upper limit (the parameter that controls this upper limit is tmp_table_size), the temporary memory table will be converted into a temporary disk table .

  • If the amount of data is large, it is likely that the disk temporary table required by this query will occupy a large amount of disk space.

These are the x factors that lead to slow SQL, let's discuss the optimization plan together.

5. Some optimization schemes for group by

From which direction to optimize?

  • Direction 1: Since it will be sorted by default, it’s fine if we don’t rank it.

  • Direction 2: Since the temporary table is the X factor that affects the performance of group by, can we not use the temporary table?

Let's think about it together, group bywhy does the execution statement need a temporary table? group byThe semantic logic of is to count the number of occurrences of different values. If these values ​​are ordered from the beginning , shouldn’t we just scan down and count directly, so we don’t need a temporary table to record and count the results ?

  • The fields after group by are indexed

  • order by null no sorting

  • Try to use only in-memory temporary tables

  • Use SQL_BIG_RESULT

5.1 Index the fields after group by

How to ensure group bythat the values ​​of the following fields are ordered from the beginning? Of course it is indexing .

Let's go back to this SQL

select city ,count(*) as num from staff where age= 19 group by city;

its execution plan

0c5b8951d8052682a9c96a70c2472d51.png

If we add a joint index to itidx_age_city(age,city)

alter table staff add index idx_age_city(age,city);

Looking at the execution plan again, I found that neither sorting nor temporary tables are needed.615d415eea475061b1a27f9778b1321e.png

Adding a suitable index is group bythe simplest and most effective way to optimize.

5.2 order by null does not need to be sorted

Not all scenarios are suitable for indexing. If we encounter a scenario that is not suitable for indexing, how can we optimize it?

If your needs do not require the result set to be sorted, you can use it order by null.

select city ,count(*) as num from staff group by city order by null

The execution plan is as follows, no filesortmore

92019bda4bce98a8cb44d3c4f8c14499.png

5.3 Try to use only temporary memory tables

If group bythere is not much data to be counted, we can try to use only temporary memory tables ; because it is time-consuming to use disk temporary tables because the memory temporary tables cannot hold data in the group by process. Therefore, the parameters can be adjusted appropriately tmp_table_sizeto avoid the use of disk temporary tables .

5.4 Optimizing with SQL_BIG_RESULT

What if the amount of data is too large? It can't be adjusted infinitely, right tmp_table_size? But you can't just watch the data first put into the temporary memory table, and then turn it into a temporary disk table as the data is inserted and found to reach the upper limit? That's a little unintelligent.

Therefore, if the estimated data volume is relatively large, we use SQL_BIG_RESULTthis prompt to directly use the disk temporary table. The MySQl optimizer found that the disk temporary table is stored as a B+ tree, and its storage efficiency is not as high as that of an array. Therefore, the array will be used directly to store

Example SQl is as follows:

select SQL_BIG_RESULT city ,count(*) as num from staff group by city;

As you can see in the fields of the execution plan Extra, the execution does not use the temporary table, but only sorting5fa0071271c4446bd284788284f7035b.png

The execution process is as follows:

  1. Initialize sort_buffer and put it into the city field;

  2. Scan the table staff, take out the value of city in turn, and store it in sort_buffer;

  3. After scanning, sort the city field of sort_buffer

  4. After sorting, an ordered array is obtained.

  5. According to the sorted array, count the number of occurrences of each value.

6. How to optimize a production slow SQL

Recently, I encountered a production slow SQL related to group by. Let me show you how to optimize it.

The table structure is as follows:

CREATE TABLE `staff` (
  `id` bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主键id',
  `id_card` varchar(20) NOT NULL COMMENT '身份证号码',
  `name` varchar(64) NOT NULL COMMENT '姓名',
  `status` varchar(64) NOT NULL COMMENT 'Y-已激活 I-初始化 D-已删除 R-审核中',
  `age` int(4) NOT NULL COMMENT '年龄',
  `city` varchar(64) NOT NULL COMMENT '城市',
  `enterprise_no` varchar(64) NOT NULL COMMENT '企业号',
  `legal_cert_no` varchar(64) NOT NULL COMMENT '法人号码',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT='员工表';

The SQL for the query is this:

select * from t1 where status = #{status} group by #{legal_cert_no}

Let's not discuss whether the = of this SQL is reasonable. If it is such a SQL, how would you optimize it? Friends who have ideas can leave a message to discuss, or you can add me to WeChat and join the group to discuss. If you think the article is wrong, you can also bring it up, let’s make progress together, come on

Welcome to follow

Ask for likes, watching, and sharing7c481e014f00ed893baa161834d563b2.png

Guess you like

Origin blog.csdn.net/bjweimengshu/article/details/131842528