Mysql window function

The basic syntax of the window function is as follows:

<窗口函数> over (partition by <用于分组的列名>
                order by <用于排序的列名>)

The position of <window function> can put the following two functions:

  1. Dedicated window function
    • Sequence number function: row_number() / rank() / dense_rank()
    • Distribution function: percent_rank() / cume_dist()
    • Before and after function: lag() / lead()
    • Head and tail function: first_val() / last_val()
    • Other functions: nth_value() (purpose: return the value of the Nth expr in the window, expr can be an expression or a column name)/nfile() (purpose: divide the ordered data in the partition into n buckets , Record the bucket number)/nfile()
  2. Aggregate functions, such as sum(), avg(), count(), max(), min(), etc.

Precautions

  • Window functions operate on the results processed by the where and group by clauses, so in principle, window functions can only be written in the select clause
  • The aggregation function is to aggregate multiple records into one; while the window function is to execute each record , the query result does not change the number of records, how many records are executed or how many .
  • Ordinary aggregate functions can also be used in window functions, giving it the functions of window functions.
  • The execution order of window functions (logically) is after FROM, JOIN, WHERE, GROUP BY, HAVING, and before ORDER BY, LIMIT, SELECT, and DISTINCT. When it is executed, the aggregation process of GROUP BY has been completed, so no data aggregation will occur.

Build a table


create table student (sid char(2), sname char(5), sclass char(2));
create table course (cid char(2), cname char(10));
create table score (sid char(2), cid char(2), score int);

insert into student values('01', '崔健', '01');
insert into student values('02', '李健', '01');
insert into student values('03', '高虎', '01');
insert into student values('04', '子健', '01');
insert into student values('05', '石璐', '01');
insert into student values('06', '亚千', '01');
insert into student values('07', '史立', '01');
insert into student values('08', '窦唯', '01');
insert into student values('09', '华东', '01');

insert into course values('01', '金属');
insert into course values('02', '迷幻');
insert into course values('03', '朋克');
insert into course values('04', '后摇');

insert into score values('01', '01', 60);
insert into score values('02', '01', 85);
insert into score values('03', '01', 57);
insert into score values('04', '01', 34);
insert into score values('05', '01', 78);
insert into score values('06', '01', 90);
insert into score values('07', '01', 76);
insert into score values('08', '01', 90);
insert into score values('09', '01', 85);
insert into score values('01', '02', 78);
insert into score values('02', '02', 59);
insert into score values('03', '02', 59);
insert into score values('04', '02', 79);
insert into score values('05', '02', 88);
insert into score values('01', '03', 65);
insert into score values('03', '03', 89);
insert into score values('05', '03', 46);
insert into score values('06', '03', 85);
insert into score values('07', '03', 89);
insert into score values('08', '03', 79);
insert into score values('03', '04', 99);
insert into score values('04', '04', 95);
insert into score values('07', '04', 68);
insert into score values('08', '04', 59);
insert into score values('09', '04', 80);

1. Dedicated window function

1.1 Sequence number function

  • row_number(), rank(), dense_rank() are all serial number functions. An example illustrates the difference between the three and ranks the grades of each course:
SELECT s.sname, c.cname, sc.score,
	ROW_NUMBER() OVER (PARTITION BY c.cname
			   ORDER BY sc.score DESC) AS row_num, 
        RANK() OVER (PARTITION BY c.cname
		     ORDER BY sc.score DESC) AS ranking,
        DENSE_RANK() OVER(PARTITION BY c.cname
		          ORDER BY sc.score DESC) AS dense_ranking
FROM student s INNER JOIN score sc ON s.sid = sc.sid
	       INNER JOIN course c ON sc.cid = c.cid	

Insert picture description here

  • row_number The same grades will not be tied together, ranked in the order of appearance
  • The same rank will be tied, and the next name is tied rank + number of tied
  • dense_rank The same score will be tied, and the next name will be tied with +1

1.2 Distribution function

  • Percent_rank()
    Purpose: Related to the previous RANK() function, each row is calculated according to the following formula:

(rank - 1) / (rows - 1)

Among them, rank is the sequence number generated by the RANK() function, and rows is the total number of records in the current window.

SELECT s.sname, c.cname, sc.score, RANK() OVER(PARTITION BY c.cname ORDER BY sc.score DESC) as ranking,
         PERCENT_RANK() OVER (PARTITION BY c.cname
			      ORDER BY sc.score DESC) as percent																									
FROM student s INNER JOIN score sc ON s.sid = sc.sid
	       INNER JOIN course c ON sc.cid = c.cid

Insert picture description here

  • Cume_dist()
    Purpose: The number of rows in the group that is greater than or equal to the current rank value/the total number of rows in the group. This function is used in more scenarios than percen_rank.
    Application scenario: What percentage of the students in a certain course rank in the top
SELECT s.sname, c.cname, sc.score, RANK() OVER(PARTITION BY c.cname ORDER BY sc.score DESC) as ranking,
       CUME_DIST() OVER (PARTITION BY c.cname
                         ORDER BY sc.score DESC) as cumdist																									
FROM student s INNER JOIN score sc ON s.sid = sc.sid
	       INNER JOIN course c ON sc.cid = c.cid;

Insert picture description here
Yaqian and Dou Wei's metal scores tied for first place, ranking in the top 22.22% of the class

1.3 Before and after function

The lag and lead functions can extract the first N rows of data (lag) and the last N rows of data (lead) of the same field in the same query.
Syntax:

LAG(EXP_STR,OFFSET,DEFVAL)OVER()
LEAD(EXP_STR,OFFSET,DEFVAL)OVER()

EXP_STR: The column to be taken
OFFSET: the row of data after the offset
DEFVAL: there is no default value that meets the conditions

Application scenario: Find the time difference between two adjacent browsing of each user; find the difference between the scores of each student's two adjacent exams

SELECT s.sname, c.cname, sc.score,
       lead(sc.score,1) OVER (PARTITION BY s.sname
                              ORDER BY sc.score DESC) as leadVal,
       lag(sc.score,1) OVER (PARTITION BY s.sname
                             ORDER BY sc.score DESC) as lagVal,
       score - leadVal as diff1,
       score - lagVal as diff2
FROM student s INNER JOIN score sc ON s.sid = sc.sid
	       INNER JOIN course c ON sc.cid = c.cid;

Insert picture description here

1.4 Head and tail functions

  • first_val()/last_val()
    Purpose: get the value of the first/last specified parameter in the partition.
SELECT s.sname, c.cname, sc.score,
       FIRST_VALUE(sc.score) OVER (PARTITION BY s.sname
                                   ORDER BY sc.score DESC) as firstVal,
       LAST_VALUE(sc.score) OVER (PARTITION BY s.sname
                                  ORDER BY sc.score DESC) as lastVal
FROM student s INNER JOIN score sc ON s.sid = sc.sid
	       INNER JOIN course c ON sc.cid = c.cid

Insert picture description here

1.5 Other functions

  • nth_value()
SELECT s.sname, s.sclass, c.cname, sc.score,
       nth_value(sc.score,1) OVER (PARTITION BY s.sname
                                   ORDER BY sc.score DESC) as 1th,
       nth_value(sc.score,2) OVER (PARTITION BY s.sname
                                   ORDER BY sc.score DESC) as 2th
FROM student s INNER JOIN score sc ON s.sid = sc.sid
               INNER JOIN course c ON sc.cid = c.cid

Insert picture description here

  • nfile()
    • Purpose: Divide the ordered data in the partition into n buckets and record the bucket number.
    • This function is widely used in data analysis. For example, due to the large amount of data, the data needs to be equally distributed to N parallel processes for calculation. At this time, NFILE(N) can be used to group the data, because the number of records is not necessarily N is divisible, so the data may not be completely even, and the extra part is added to the first group and the second group in turn until the allocation is completed.

Guess you like

Origin blog.csdn.net/qq_42962353/article/details/109066608