foreword
Hello everyone, I am programmer Xiaohui .
In daily development, we often use it group by
. Dear friends, do you know group by
how it works? group by
and having
what is the difference? group by
What is your optimization idea? What are the problems that need to be paid attention to when using it group by
? This article will study with you and conquer group by
~
Simple example of using group by
How group by works
The difference between group by + where and having
group by optimization ideas
Notes on using group by
How to optimize a production slow SQL
1. A simple example of using group by
group by
Generally used for group statistics , the logic it expresses is 根据一定的规则,进行分组
. Let's start with a simple example and review it together.
Assuming an employee table is used, the table structure is as follows:
CREATE TABLE `staff` (
`id` bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主键id',
`id_card` varchar(20) NOT NULL COMMENT '身份证号码',
`name` varchar(64) NOT NULL COMMENT '姓名',
`age` int(4) NOT NULL COMMENT '年龄',
`city` varchar(64) NOT NULL COMMENT '城市',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT='员工表';
The table inventory data is as follows:
We now have such a need: count the number of employees in each city . The corresponding SQL statement can be written like this:
select city ,count(*) as num from staff group by city;
The execution results are as follows:
The logic of this SQL statement is very clear, but what is its underlying execution process?
2. Group by principle analysis
2.1 explain analysis
explain
Let's take a look at the execution plan first
explain select city ,count(*) as num from staff group by city;
Extra This field indicates that a temporary table
Using temporary
is used when performing groupingExtra The representation of this field
Using filesort
uses sorting
group by
How did you use it 临时表和排序
? Let's take a look at the execution flow of this SQL
2.2 Simple execution process of group by
explain select city ,count(*) as num from staff group by city;
Let's take a look at the execution process of this SQL.
Create a temporary memory table with two fields
city
andnum
;For the records of the full table scan
staff
, the records with city = 'X' are taken out in sequence.
Determine whether there is a row with city='X' in the temporary table , if not, insert a record (X,1);
If there is a row with city='X' in the temporary table, add 1 to the num value of the row of x;
After the traversal is completed, sort according to the field
city
, and return the result set to the client.
The execution diagram of this process is as follows:
What is the sorting of the temporary table?
It is to put the fields that need to be sorted into the sort buffer, and return after sorting. Pay attention here, sorting is divided into full field sorting and rowid sorting
If yes
全字段排序
, put all the fields that need to be returned by the query , and return them directlysort buffer
according to the sorting fieldsIf so
rowid排序
, just put in the fields that need to be sorted , and then return to the tablesort buffer
one more time , and then return.How to determine whether to sort by all fields or by rowid? controlled by a database parameter,
max_length_for_sort_data
3. The difference between where and having
Execution flow of group by + where
Execution flow of group by + having
At the same time, there are execution sequences of where, group by, and having
3.1 Execution process of group by + where
Some friends think that the SQL in the previous section is too simple. If the where condition is added and the where condition column is indexed, what is the execution process ?
Ok, let's add a condition to it, and add an idx_age
index, as follows:
select city ,count(*) as num from staff where age> 30 group by city;
//加索引
alter table staff add index idx_age (age);
Let's explain again:
explain select city ,count(*) as num from staff where age> 30 group by city;
From the result of the explain execution plan, you can find idx_age
the index that the query condition hits, and use临时表和排序
Using index condition : Indicates index pushdown optimization, filtering data as much as possible according to the index, and then returning to the server layer to filter according to where and other conditions. Why is there an index pushdown for a single index here? The appearance of explain does not mean that the index pushdown must be used, it just means that it can be used, but it is not necessarily used. If you have any ideas or questions, you can add me to WeChat to discuss.
The execution process is as follows:
Create a temporary memory table with two fields
city
andnum
;Scan the index tree
idx_age
to find primary key IDs with an age greater than 30Through the primary key ID, return to the table to find city = 'X'
Determine whether there is a row with city='X' in the temporary table , if not, insert a record (X,1);
If there is a row with city='X' in the temporary table, add 1 to the num value of the row of x;
Continue to repeat steps 2 and 3 to find all data that meets the conditions,
Finally, sort according to the field
city
, get the result set and return it to the client.
3.2 Execution of group by + having
If you want to query the number of employees in each city and get the cities with no less than 3 employees, having can solve your problem very well. SQL sauce writes:
select city ,count(*) as num from staff group by city having num >= 3;
The query results are as follows:
having
Called the grouping filter, it operates on the returned result set.
3.3 Execution order of where, group by, and having
If a SQL contains where、group by、having
clauses at the same time, what is the order of execution?
For example this SQL:
select city ,count(*) as num from staff where age> 19 group by city having num >= 3;
Execute
where
the clause to find employee data whose age is greater than 19group by
clause on employee data, grouped by city.For
group by
the city groups formed by clauses, run the aggregate function to calculate the number of employees in each group;Finally, use
having
the clause to select the city group with the number of employees greater than or equal to 3.
3.4 Where + having difference summary
having
The clause is used for filtering after grouping , and the where clause is used for row condition filteringhaving
It is usually combinedgroup by
with aggregate functions such as (count(),sum(),avg(),max(),min()
)where
Aggregate functions cannot be used in condition clauses, whereashaving
clauses can.having
It can only be used after group by, where executes before group by
4. Problems to be noticed when using group by
There are several points to note when using group by:
group by
Must it be used together with aggregate functions?group by
The field must appear in the selectgroup by
The slow SQL problem caused by
4.1 Must group by be used with aggregation functions?
group by means group statistics , and it is generally 如(count(),sum(),avg(),max(),min())
used together with aggregation functions.
count() number
sum() sum
avg() average
max() maximum value
min() minimum value
Can it be used without an aggregate function?
I use Mysql 5.7 , it is possible. No error will be reported, and what is returned is the first row of data grouped.
For example this SQL:
select city,id_card,age from staff group by city;
The query result is
Let's compare and see, what is returned is the first data of each group
Of course, when you usually use it, group by is still used with aggregation functions, unless there are some special scenarios, such as you want todistinct
duplication, of course , it is also possible to reuse .
4.2 The fields following group by must appear in the select.
Not necessarily, such as the following SQL:
select max(age) from staff group by city;
The execution results are as follows:
The grouping field city
is not behind the select, and no error will be reported. Of course, this may be related to different databases and different versions . When you use it, you can verify it first. There is a saying that what is achieved on paper will eventually become superficial, but you will never know that you have to do it yourself .
4.3 group by
Slow SQL problems caused by
Now comes the most important point of attention. group by
Improper use can easily lead to slow SQL problems. Because it uses both temporary tables and sorting by default . Sometimes disk temporary tables may also be used .
If during execution, you will find that the size of the temporary memory table has reached the upper limit (the parameter that controls this upper limit is
tmp_table_size
), the temporary memory table will be converted into a temporary disk table .If the amount of data is large, it is likely that the disk temporary table required by this query will occupy a large amount of disk space.
These are the x factors that lead to slow SQL, let's discuss the optimization plan together.
5. Some optimization schemes for group by
From which direction to optimize?
Direction 1: Since it will be sorted by default, it’s fine if we don’t rank it.
Direction 2: Since the temporary table is the X factor that affects the performance of group by, can we not use the temporary table?
Let's think about it together, group by
why does the execution statement need a temporary table? group by
The semantic logic of is to count the number of occurrences of different values. If these values are ordered from the beginning , shouldn’t we just scan down and count directly, so we don’t need a temporary table to record and count the results ?
The fields after group by are indexed
order by null no sorting
Try to use only in-memory temporary tables
Use SQL_BIG_RESULT
5.1 Index the fields after group by
How to ensure group by
that the values of the following fields are ordered from the beginning? Of course it is indexing .
Let's go back to this SQL
select city ,count(*) as num from staff where age= 19 group by city;
its execution plan
If we add a joint index to itidx_age_city(age,city)
alter table staff add index idx_age_city(age,city);
Looking at the execution plan again, I found that neither sorting nor temporary tables are needed.
Adding a suitable index is group by
the simplest and most effective way to optimize.
5.2 order by null does not need to be sorted
Not all scenarios are suitable for indexing. If we encounter a scenario that is not suitable for indexing, how can we optimize it?
If your needs do not require the result set to be sorted, you can use it
order by null
.
select city ,count(*) as num from staff group by city order by null
The execution plan is as follows, no filesort
more
5.3 Try to use only temporary memory tables
If group by
there is not much data to be counted, we can try to use only temporary memory tables ; because it is time-consuming to use disk temporary tables because the memory temporary tables cannot hold data in the group by process. Therefore, the parameters can be adjusted appropriately tmp_table_size
to avoid the use of disk temporary tables .
5.4 Optimizing with SQL_BIG_RESULT
What if the amount of data is too large? It can't be adjusted infinitely, right tmp_table_size
? But you can't just watch the data first put into the temporary memory table, and then turn it into a temporary disk table as the data is inserted and found to reach the upper limit? That's a little unintelligent.
Therefore, if the estimated data volume is relatively large, we use SQL_BIG_RESULT
this prompt to directly use the disk temporary table. The MySQl optimizer found that the disk temporary table is stored as a B+ tree, and its storage efficiency is not as high as that of an array. Therefore, the array will be used directly to store
Example SQl is as follows:
select SQL_BIG_RESULT city ,count(*) as num from staff group by city;
As you can see in the fields of the execution plan Extra
, the execution does not use the temporary table, but only sorting
The execution process is as follows:
Initialize sort_buffer and put it into the city field;
Scan the table staff, take out the value of city in turn, and store it in sort_buffer;
After scanning, sort the city field of sort_buffer
After sorting, an ordered array is obtained.
According to the sorted array, count the number of occurrences of each value.
6. How to optimize a production slow SQL
Recently, I encountered a production slow SQL related to group by. Let me show you how to optimize it.
The table structure is as follows:
CREATE TABLE `staff` (
`id` bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主键id',
`id_card` varchar(20) NOT NULL COMMENT '身份证号码',
`name` varchar(64) NOT NULL COMMENT '姓名',
`status` varchar(64) NOT NULL COMMENT 'Y-已激活 I-初始化 D-已删除 R-审核中',
`age` int(4) NOT NULL COMMENT '年龄',
`city` varchar(64) NOT NULL COMMENT '城市',
`enterprise_no` varchar(64) NOT NULL COMMENT '企业号',
`legal_cert_no` varchar(64) NOT NULL COMMENT '法人号码',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT='员工表';
The SQL for the query is this:
select * from t1 where status = #{status} group by #{legal_cert_no}
Let's not discuss whether the = of this SQL is reasonable. If it is such a SQL, how would you optimize it? Friends who have ideas can leave a message to discuss, or you can add me to WeChat and join the group to discuss. If you think the article is wrong, you can also bring it up, let’s make progress together, come on
Welcome to follow
Ask for likes, watching, and sharing