First look at this piece of SQL like a bible, and it will give you a headache.
SELECT
s1.name,
s1.subject,
s1.score,
sub.avg_score AS average_score_per_subject,
(SELECT COUNT(DISTINCT s2.score) + 1 FROM scores s2 WHERE s2.score > s1.score) AS score_rank
FROM scores s1
JOIN (
SELECT subject, AVG(score) AS avg_score
FROM scores
GROUP BY subject
) sub ON s1.subject = sub.subject
ORDER BY s1.score DESC;
What is this SQL for? It's just to calculate a grade ranking. It's a big fight.
Is there any way to simplify it? Yes.
The simplified version is to use today's window function.
SELECT
name,
subject,
score,
AVG(score) OVER (PARTITION BY subject) AS average_score_per_subject,
RANK() OVER (ORDER BY score DESC) AS score_rank
FROM scores
ORDER BY score DESC;
Doesn't it look more concise and clear.
Let's see what kind of function it is.
First create a table containing three fields of name, subject, and score for the demonstration of the following functions.
CREATE TABLE `scores` (
`name` varchar(20) COLLATE utf8_bin NOT NULL,
`subject` varchar(20) COLLATE utf8_bin NOT NULL,
`score` int(3) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Then insert some random records into the table.
INSERT INTO scores (name, subject, score) VALUES ('Student1', '化学', 75);
INSERT INTO scores (name, subject, score) VALUES ('Student2', '生物', 92);
INSERT INTO scores (name, subject, score) VALUES ('Student3', '物理', 87);
INSERT INTO scores (name, subject, score) VALUES ('Student4', '数学', 68);
INSERT INTO scores (name, subject, score) VALUES ('Student5', '英语', 91);
INSERT INTO scores (name, subject, score) VALUES ('Student6', '化学', 58);
INSERT INTO scores (name, subject, score) VALUES ('Student7', '物理', 79);
INSERT INTO scores (name, subject, score) VALUES ('Student8', '数学', 90);
INSERT INTO scores (name, subject, score) VALUES ('Student9', '数学', 45);
##What is a window function
In the MySQL 8.x version, MySQL provides window functions, which are functions that perform calculations within a specific window range of query results.
When I used Oracle and MS SQL a long time ago, I used the window functions in it, but after using MySQL, I found that MySQL does not have window functions, so that some responsible statistical queries have to use various subqueries, joins, layer by layer Nesting, a seemingly simple requirement, turns out that the SQL statement is written in a flying style, which looks like a scripture to others. Just one word, dumbfounded.
The main application scenarios of window functions are statistics and calculations, such as grouping, sorting, and computing aggregation of query results. Through the combination of various functions, various complex logics can be realized, and compared with MySQL 8.0, subqueries and joins are used. , much better performance.
OVER()
OVER() is a clause used to define a window function, it must be combined with other functions to make sense, such as summing and averaging. Instead, it is only used to specify the data range and sorting method to be calculated.
function_name(...) OVER (
[PARTITION BY expr_list]
[ORDER BY expr_list]
[range]
)
PARTITION BY
It is used to specify the partition field, and analyze and calculate different partitions. The partition is actually a column, and one column or multiple columns can be specified.
ORDER BY
It is used to sort the records in the partition. After sorting, it can be used together with "Range and Rolling Window".
Ranges and rolling windows
Windows for specifying analytical functions, including range and rolling windows.
Range window
Specify the start and end line numbers of the window, use UNBOUNDED PRECEDING to indicate the start point, and UNBOUNDED FOLLOWING to indicate the end point.
For example:
SUM(salary) OVER (ORDER BY id
RANGE BETWEEN 5 PRECEDING AND 5 FOLLOWING)
This will calculate the salary sum of the current row and the previous 5 rows and the next 5 rows.
Rolling window (Row window)
A scrolling window based on the current row is used
For example:
SUM(salary) OVER (ORDER BY id
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)
This will calculate the salary sum of the current row and the previous 2 rows and the next 2 rows.
Functions that OVER() can match:
aggregate function
MAX(), MIN(), COUNT(), SUM(), etc. are used to generate aggregated results for each partition.
Sort related
ROW_NUMBER(), RANK(), DENSE_RANK(), etc., are used to generate row numbers or ranks for each partition.
window function
LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(), etc. for generating results based on window boxes.
with aggregate functions
1. subject
Partition by column and find the maximum and minimum values of a subject
Get the score and the highest score for this subject
SELECT subject,score, MAX(score) OVER (PARTITION BY subject) as `此学科最高分` FROM scores;
The result is:
subject | score | Highest score in this subject |
---|---|---|
Chemical | 75 | 75 |
Chemical | 58 | 75 |
math | 68 | 90 |
math | 90 | 90 |
math | 45 | 90 |
physics | 87 | 87 |
physics | 79 | 87 |
biology | 92 | 92 |
English | 91 | 91 |
2. Obtain the number of applicants for the subject
SELECT subject,score, count(name) OVER (PARTITION BY subject) as `报名此学科人数` FROM scores;
The result obtained is:
subject | score | Number of people enrolled in this course |
---|---|---|
Chemical | 75 | 2 |
Chemical | 58 | 2 |
math | 68 | 3 |
math | 90 | 3 |
math | 45 | 3 |
physics | 87 | 2 |
physics | 79 | 2 |
biology | 92 | 1 |
English | 91 | 1 |
3. Find the total score of the subject
SELECT subject, SUM(score) OVER (PARTITION BY subject) as `此学科总分` FROM scores;
The results obtained:
subject | Total marks for this subject |
---|---|
Chemical | 133 |
Chemical | 133 |
math | 203 |
math | 203 |
math | 203 |
physics | 166 |
physics | 166 |
biology | 92 |
English | 91 |
4. Use order by to find the cumulative score
SELECT name,subject,score, SUM(score) OVER (order BY score) as `累加分数` FROM scores;
The results obtained:
name | subject | score | cumulative score |
---|---|---|---|
Student9 | math | 45 | 45 |
Student6 | Chemical | 58 | 103 |
Student4 | math | 68 | 171 |
Let's see how this is calculated, the OVER function is order by.
First sort according to the score (default ascending order), and get the score of the first row is 45, so the cumulative score is itself, which is 45.
Then sort to get the second row 58, and then add the first row and the second row, so that the cumulative score is 45+58=103.
Similarly, the third line is the sum of the first three lines, which is 45+58+68=171.
By analogy, the Nth row is the cumulative sum of 1~N.
5. Use order by + range
Because there is no limited range, it is the accumulation of the first N rows, and the range can also be limited.
SELECT name,subject,score, SUM(score) OVER (order BY `score` ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as `累加分数` FROM scores;
The cumulative score here refers to the sum of the current line + the previous line + the next line.
The obtained result is:
name | subject | score | cumulative score |
---|---|---|---|
Student9 | math | 45 | 103 |
Student6 | Chemical | 58 | 171 |
Student4 | math | 68 | 201 |
Student1 | Chemical | 75 | 222 |
Student7 | physics | 79 | 241 |
Student3 | physics | 87 | 256 |
Student8 | math | 90 | 268 |
Student5 | English | 91 | 273 |
The first line 103 is the sum of the current line 45 + the next line (58), which is equal to 103 because there is no previous line.
The second line 171 is the sum of the current line 58 + the previous line (45) + the next line (68), which is equal to 171.
With this type, the subsequent cumulative scores are calculated in this way.
Collocation sort related functions
ROW_NUMBER()
ROW_NUMBER() 函数用于为结果集中的每一行分配一个唯一的排序。
如下,对成绩进行排名,分数高的排在前面,如果有两个人分数相同,那仍然是一个第一,另一个第二。
SELECT name,subject,score, ROW_NUMBER() OVER (order BY `score` desc) as `排名` FROM scores;
查询结果为:
name | subject | score | 排名 |
---|---|---|---|
Student2 | 生物 | 92 | 1 |
Student5 | 英语 | 91 | 2 |
Student8 | 数学 | 90 | 3 |
Student3 | 物理 | 87 | 4 |
Student7 | 物理 | 79 | 5 |
如果不用 ROW_NUMBER()
,比如在 MySQL 5.7的版本中,就会像下面这样:
SELECT s1.name, s1.subject, s1.score, COUNT(s2.score) + 1 AS `排名`
FROM scores s1
LEFT JOIN scores s2 ON s1.score < s2.score
GROUP BY s1.name, s1.subject, s1.score
ORDER BY s1.score DESC;
是不是比使用 ROW_NUMBER()
复杂的多。
RANK()
RANK() 函数用于为结果集中的每一行分配一个排名值,它也是排名的,但是它和 ROW_NUMBER()
有,RANK()
函数在遇到相同值的行会将排名设置为相同的,就像是并列排名。
就像是奥运比赛,如果有两个人都是相同的高分,那可能就是并列金牌,但是这时候就没有银牌了,仅次于这两个人的排名就会变成铜牌。
SELECT name,subject,score, RANK() OVER (order BY `score` desc) as `排名` FROM scores;
查询结果为:
name | subject | score | 排名 |
---|---|---|---|
Student1 | 化学 | 92 | 1 |
Student2 | 生物 | 92 | 1 |
Student5 | 英语 | 91 | 3 |
Student8 | 数学 | 90 | 4 |
Student3 | 物理 | 87 | 5 |
DENSE_RANK()
DENSE_RANK() 也是用作排名的,和 RANK()
函数的差别就是遇到相同值的时候,不会跳过排名,比如两个人是并列金牌,排名都是1,那仅次于这两个人的排名就是2,而不像 RANK()
那样是3。
SELECT name,subject,score, DENSE_RANK() OVER (order BY `score` desc) as `排名` FROM scores;
查询结果为:
name | subject | score | 排名 |
---|---|---|---|
Student1 | 化学 | 92 | 1 |
Student2 | 生物 | 92 | 1 |
Student5 | 英语 | 91 | 2 |
Student8 | 数学 | 90 | 3 |
配合其他窗口函数
NTILE()
NTILE() 函数用于将结果集划分为指定数量的组,并为每个组分配一个编号。例如,将分数倒序排序并分成4个组,相当于有了4个梯队。
SELECT name,subject,score, NTILE(4) OVER (order BY `score` desc) as `组` FROM scores;
查询结果为:
name | subject | score | 组 |
---|---|---|---|
Student1 | 化学 | 92 | 1 |
Student2 | 生物 | 92 | 1 |
Student5 | 英语 | 91 | 1 |
Student8 | 数学 | 90 | 2 |
Student3 | 物理 | 87 | 2 |
Student7 | 物理 | 79 | 3 |
Student4 | 数学 | 68 | 3 |
Student6 | 化学 | 58 | 4 |
Student9 | 数学 | 45 | 4 |
LAG()
LAG() 函数用于在查询结果中访问当前行之前的行的数据。它允许您检索前一行的值,并将其与当前行的值进行比较或计算差异。LAG()
函数对于处理时间序列数据或比较相邻行的值非常有用。
LAG()
函数完整的表达式为 LAG(column, offset, default_value)
,包含三个参数:
column:就是列名,获取哪个列的值就是哪个列名,很好理解。
offset: 就是向前的偏移量,取当前行的前一行就是1,前前两行就是2。
default_value:是可选值,如果向前偏移的行不存在,就取这个默认值。
例如比较相邻两个排名的分数差,可以这样写:
SELECT
name,
subject,
score,
ABS(score - LAG(score, 1,score) OVER (ORDER BY score DESC)) AS `分值差`
FROM
scores;
得到的结果为:
name | subject | score | 分值差 |
---|---|---|---|
Student1 | 化学 | 92 | 0 |
Student2 | 生物 | 92 | 0 |
Student5 | 英语 | 91 | 1 |
Student8 | 数学 | 90 | 1 |
Student3 | 物理 | 87 | 3 |
Student7 | 物理 | 79 | 8 |
Student4 | 数学 | 68 | 11 |
LEAD()
LEAD()
函数和 LAG()
的功能一致,只不过它的偏移量是向后偏移,也就是取当前行的后 N 行。
所以前面的比较相邻两行差值的逻辑,也可以向后比较。
SELECT
name,
subject,
score,
score - LEAD(score, 1,score) OVER (ORDER BY score DESC) AS `分值差`
FROM
scores;
得到的结果:
name | subject | score | 分值差 |
---|---|---|---|
Student1 | 化学 | 92 | 0 |
Student2 | 生物 | 92 | 1 |
Student5 | 英语 | 91 | 1 |
Student8 | 数学 | 90 | 3 |
Student3 | 物理 | 87 | 8 |
Student7 | 物理 | 79 | 11 |
Student4 | 数学 | 68 | 10 |