MySQL8.0 database window function

 

Introduction

      The database windowing function is a function used in SQL, which can be used to group and sort the data in the result set for better analysis and processing of the data. The windowing function is different from the aggregation function, it does not aggregate multiple rows of data into one row, but keeps each row of data, and groups and sorts it.

Common windowing functions include ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(), LAG(), LEAD(), etc. These functions help users generate grouped and sorted results in result sets for better understanding and analysis of data.

For example, use the ROW_NUMBER() function to group a result set based on one or more fields and generate a row number within each group so users can easily keep track of the data. Use the LAG() and LEAD() functions to extract data before and after each row in the result set so that the user can view data before or after the current row.

The window function is a very useful tool in SQL, which can help users group and sort the data in the result set for better analysis and processing of the data.

MySQL official documentationhttps://dev.mysql.com/doc/refman/8.0/en/window-functions.html

Note : The official explanation of the window opening function is only available after MySQL8.0.

1. What is the difference between a window function and an aggregate function?

  • Data processing range: Aggregation functions can only operate on the entire data table or data set, and the calculation result is a single value. The window function can operate on each row, and the calculation result will be displayed on each row.
  • Calculation result: There is only one calculation result of the aggregate function, which is usually used to perform operations such as summation, average value, calculation of maximum/minimum value, etc. While the windowing function can have multiple results, it provides additional columns for each row in the query result set.
  • Syntax: Aggregate functions are usually used in the SELECT clause and HAVING clause in the SELECT statement, while windowing functions are usually used after the OVER keyword.

2. Officially explained window opening function

  •  translate

The official statement is very official, but it is still a little difficult to understand.

3. Segmentation of window function

3.1, serial number

  • ROW_NUMBER() : This function groups the result set by one or more fields and generates a row number within each grouping so that the user can easily keep track of the data.
  • RANK() : This function can sort the result set according to one or more fields and generate a rank in each sort so that the user can understand the size and order of the data.
  • DENSE_RANK() : This function can sort the result set according to one or more fields and generate a ranking in each sort, but the skipped position is one less than the RANK() function.

3.2. Distribution

  • PERCENT RANK() : The function is used to calculate the percentage rank of each value in the dataset.
  • CUME_DIST() : Function used to calculate the cumulative density rank for each value in the dataset.

3.3 Before and after

  • LAG() : This function can extract data before each row in the result set, so that the user can view the data before the current row.
  • LEAD() : This function can extract data after each row in the result set so that the user can view the data after the current row.

3.4, beginning and end

  • FIRST_VALUE() : The function returns the first value in the ordered partition of the result set.
  • LAST_VALUE() : The function returns the last value in the ordered partition of the result set.

3.5. Others

  • NTILE() : This function can group the result set according to one or more fields, and distribute each group into the specified number of buckets, so that users can better analyze and group the data.
  • NTH_VALUE() : The function returns the value of the nth row in the ordered partition of the result set.

Fourth, the use of grammar

4.1. Grammatical structure

<窗口函数> OVER ([PARTITION BY <分组列>] [ORDER BY <排序列> {ASC|DESC}] [<行窗口>|<范围窗口>] [<开始位置>|<结束位置>|<长度>])
  • <window function> indicates the aggregation function to be executed, such as SUM, AVG, MAX, MIN, COUNT, etc.;
  • <grouping column> indicates the column to be grouped;
  • <sorting column> indicates which column to sort by, and multiple sorting columns can be specified, separated by commas;
  • <row window> and <range window> represent row-level window and range-level window respectively;
  • <start position>, <end position> and <length> indicate the start position, end position and length of the window.

 In MySQL 8.0, a row window is a set of contiguous rows that are considered as a whole and can be used in the computation of window functions.

Row windows are specified by the following keywords:

  • ROWS: Indicates the row window.
  • BETWEEN: Used to specify the start and end positions of the row window.
  • PRECEDING: Indicates the starting position of the row window.
  • FOLLOWING: Indicates the end position of the row window.

Commonly used row window specification methods:

  • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: Indicates from the first row of the result set to the current row, including the current row.
  • ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING: Indicates from the current row to the last row of the result set, including the current row.
  • ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING: Indicates each row before and after including the current row.

Explanation: The row window can be used to calculate the sum, average, count and other aggregation operations of each group, and can also be used to calculate the ranking, cumulative sum and other operations of each row.

4.2. Ordinary aggregation functions as windowing functions

  • Ordinary aggregation functions can only operate on the entire data table or data set, and the calculation result is a single value. The window function can operate on each row, and the calculation result will be displayed on each row.

4.2.1, table structure

DROP TABLE IF EXISTS `order_for_goods`;
CREATE TABLE `order_for_goods`  (
  `order_id` int(0) NOT NULL AUTO_INCREMENT,
  `user_id` int(0) NULL DEFAULT NULL,
  `money` decimal(10, 2) NULL DEFAULT NULL,
  `quantity` int(0) NULL DEFAULT NULL,
  `join_time` datetime(0) NULL DEFAULT NULL,
  PRIMARY KEY (`order_id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 12 CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;

4.2.2, table data

INSERT INTO order_for_goods (user_id, money, quantity, join_time )
VALUES
	( 1001, 1800.90, 1, '2023-06-07'),
	( 1001, 3600.89, 5, '2023-05-02'),
	( 1001, 1000.10, 6, '2023-01-08'),
	( 1002, 1100.90, 9, '2023-04-07'),
	( 1002, 4500.99, 1, '2023-03-14'),
	( 1003, 2500.10, 3, '2023-02-14'),
	( 1002, 2500.90, 1, '2023-03-14'),
	( 1003, 2500.90, 1, '2022-12-12'),
	( 1003, 2500.90, 2, '2022-09-08'),
    ( 1003, 6000.90, 8, '2023-01-10');

4.2.3. Ordinary functions as windowing functions

1. The statement is as follows

select 
	*,
	sum(money) over(partition by user_id order by order_id) as alias_sum,
	avg(money) over(partition by user_id order by order_id) as alias_avg,
	max(money) over(partition by user_id order by order_id) as alias_max,
	min(money) over(partition by user_id order by order_id) as alias_min,
	count(money) over(partition by user_id order by order_id) as alias_count
from order_for_goods;
  • Selected all the columns from the order_for_goods table and calculated the total amount, average amount, max amount, min amount and count of each order for each user.
  • This query uses the sum(), avg(), max(), min(), and count() functions to calculate the total, average, maximum, minimum, and count for each order. These functions are followed by an over() clause that specifies the window for the computation. In this example, the window is partitioned by user_id and sorted by order_id.

2. The query results return the selected columns and the calculated alias columns as follows

4.3, serial number function

4.3.1, ROW_NUMBER () function

1. Execute the statement

select *
from (
	select *,row_number() over(partition by user_id order by money desc) as alias_row_number
	from order_for_goods) t
where alias_row_number<=3;
  • The above SQL statement uses the window function row_number() to assign serial numbers to the rows in each partition. The outer query then selects the first three highest rows from these ordinals.
  • The inner query selects all the columns from the order_for_goods table and uses the row_number() function to assign ordinal numbers to the rows within each partition. In this example, the subquery partitions the data by the user_id column and sorts it in descending order by the money column.
  • The outer query selects rows from the results of the inner query with ordinal numbers less than or equal to 3, which correspond to the top three highest rows in the partition.

2. Execution results

 3. Execute the statement

select *
from (
	select *,row_number() over(partition by user_id order by money desc) as alias_row_number
	from order_for_goods) t
where alias_row_number<=1;
  •  The above query statement is similar to the previous one, except that alias_row_number<=3 is changed to alias_row_number<=1, so the result will only return the highest row in the partition.

4. Execution results

 Summary: You can think about it with divergent thinking, give a chestnut: For example, count the top three sales in each commodity field. Can using windowing solve many problems, and avoid a lot of difficult-to-maintain and incomprehensible sql logic.

 4.3.2, RANK() function

1. Execute the statement

select 
	*,
    rank() over(partition by user_id order by money desc) as alias_rank 
from order_for_goods;
  •  The above SQL statement uses the window function rank() to calculate an alias rank (alias_rank) for each user.
  • The rank() function computes a rank value for consecutive ranks within each partition, so this statement computes an alias rank for each user.
  • Note that the statement does not specify any conditions, so it returns all rows and columns in the order_for_goods table. If you need to query specific rows or columns, you can specify the corresponding conditions or column names in the select clause.

 2. Execution results

 4.3.3, DENSE_RANK () function

 1. Execute the statement

select 
	*,
    dense_rank() over(partition by user_id order by money desc) as alias_dense_rank 
from order_for_goods;
  •  The above SQL statement uses the window function dense_rank() to calculate an alias dense ranking (alias_dense_rank) for each user.
  • The dense_rank() function calculates a rank value for the rank within each partition, and for adjacent rows with the same rank value, the rank values ​​are assigned consecutively. Therefore, this statement computes an alias dense rank for each user.
  • Note that the statement does not specify any conditions, so it returns all rows and columns in the order_for_goods table. If you need to query specific rows or columns, you can specify the corresponding conditions or column names in the select clause.

 2. Execution results

 4.3.4. Comparison of the above three serial number functions

 1. Execute the statement

select 
	*,
 	row_number() over(partition by user_id order by money desc) as alias_row_number,
    rank() over(partition by user_id order by money desc) as alias_rank,
 	dense_rank() over(partition by user_id order by money desc) as alias_dense_rank
from order_for_goods;
  •  Selected all the columns from the order_for_goods table, and calculated the total amount of each user in each order, and calculated the ordinal, rank and dense rank of each user in each order.
  • This query uses the row_number(), rank(), and dense_rank() functions to compute the ordinal, rank, and dense rank of the rows within each partition. These functions are followed by an over() clause that specifies the window for the computation. In this example, the window is partitioned by user_id and sorted in descending order by the money column.

  2. Execution results

4.4. Distribution function

4.4.1, PERCENT RANK () function

  1. Execute the statement

select 
	*,
	percent_rank() over(partition by user_id order by money desc) as alias_percent_rank
from order_for_goods;
  •  Selected all the columns from the order_for_goods table, and calculated the total amount of each user in each order, and calculated the percentage rank of each user in each order.
  • This query uses the percent_rank() function to calculate the percentile rank of rows within each partition. This function is followed by an over() clause that specifies the window for the calculation. In this example, the window is partitioned by user_id and sorted in descending order by the money column.

  2. Execution results

4.4.2, CUME_DIST () function

  1. Execute the statement

select 
	*,
	cume_dist() over(partition by user_id order by money desc) as alias_percent_rank
from order_for_goods;
  •  Selected all the columns from the order_for_goods table, and calculated the total amount of each user in each order, and calculated the cumulative percentage of each user in each order.
  • This query uses the cume_dist() function to calculate the cumulative percentage of rows within each partition. This function is followed by an over() clause that specifies the window for the calculation. In this example, the window is partitioned by user_id and sorted in descending order by the money column.

  2. Execution results

4.5. Front and back functions

4.5.1. LAG() function

1. Grammatical description

  • The LAG() function is a function used to move forward in a time series by a specified period.
LAG(expression, offset, default_value)
  1.  expression: the column to be valued
  2.  offset: the value of the first few rows forward
  3.  default_value: If there is no value, the default value can be set

2. Execute the statement

select 
	*,
	lag(join_time, 1, 0) over(partition by user_id order by join_time desc) as alias_lag
from order_for_goods;

 3. Execution results

4.5.2. LEAD() function

1. Grammatical description

  • The LEAD() function is a function used to move backward in a time series by a specified period.
LAG(expression, offset, default_value)
  1.  expression: the column to be valued
  2.  offset: the value of the number of rows backward
  3.  default_value: If there is no value, the default value can be set

2. Execute the statement

select 
	*,
	lead(join_time, 1, 0) over(partition by user_id order by join_time desc) as alias_lead
from order_for_goods;

3. Execution results

4.6. Closing function

4.6.1, FIRST_VALUE () function

1. Grammatical description

  • FIRST_VALUE: Take the value of the first row of the window
FIRST_VALUE(expression)
  1.  expression: An expression that specifies the column or calculation result to get the value of the first row.

2. Execution syntax

select 
	*,
	first_value(money) over(partition by user_id order by join_time desc) as alias_first_value
from order_for_goods;
  • Note that if a user has no data for the specified time range, the LAST_VALUE() function will return a default value of NULL. 

 3. Execution results

 

4.6.2, LAST_VALUE () function

1. Grammatical description

  • LAST_VALUE: Take the value of the last row of the window.
LAST_VALUE(expression)
  1.  expression: An expression that specifies the column or calculation result to get the value of the last row.

2. Execution syntax

select 
	*,
	last_value(money) over(partition by user_id order by join_time) as alias_last_value
from order_for_goods;
  • Note that if a user has no data for the specified time range, the LAST_VALUE() function will return a default value of NULL.

2. Execution results

 3. Explain

  1. You may find that LAST_VALUE() does not take the last value of the window. The window is partitioned by user_id and sorted by the join_time column. It is reasonable to return the money in the 1001 partition to be 1800.90? Why? Why?
  2. The reason is that the default statistical range of LAST_VALUE() is rows between unbounded preceding and current row

  3. Verification

select 
	*,
	last_value(money) over(partition by user_id order by join_time) as alias_last_value1,
	last_value(money) over(partition by user_id order by join_time rows between unbounded preceding and current row) as alias_last_value2,
	last_value(money) over(partition by user_id order by join_time rows between unbounded preceding and unbounded following) as alias_last_value3
from order_for_goods;
  •  It can be seen that the alias alias_last_value2 has verified that the default statistical range of LAST_VALUE() is rows between unbounded preceding and current row (meaning that the calculation is performed from the current row forward without boundaries, that is, the results of all rows before the current row are calculated.)
  •  It can be seen that the alias alias_last_value3 specifies rows between unbounded preceding and unbounded following (indicating that the calculation is performed forward and backward from the current row without boundaries, that is, the result of calculating the entire partition.) can be obtained under all statistics, user_id partition, The join_time column is sorted, and the last transaction amount of the field money in the 1001 partition is returned as 1800.90.
+----------+---------+---------+----------+---------------------+------------------+------------------+------------------+
| order_id | user_id | money   | quantity | join_time           | alias_last_value | alias_last_value | alias_last_value |
+----------+---------+---------+----------+---------------------+------------------+------------------+------------------+
|       34 |    1001 | 1000.10 |        6 | 2023-01-08 00:00:00 |          1000.10 |          1000.10 |          1800.90 |
|       33 |    1001 | 3600.89 |        5 | 2023-05-02 00:00:00 |          3600.89 |          3600.89 |          1800.90 |
|       32 |    1001 | 1800.90 |        1 | 2023-06-07 00:00:00 |          1800.90 |          1800.90 |          1800.90 |
|       36 |    1002 | 4500.99 |        1 | 2023-03-14 00:00:00 |          2500.90 |          4500.99 |          1100.90 |
|       38 |    1002 | 2500.90 |        1 | 2023-03-14 00:00:00 |          2500.90 |          2500.90 |          1100.90 |
|       35 |    1002 | 1100.90 |        9 | 2023-04-07 00:00:00 |          1100.90 |          1100.90 |          1100.90 |
|       40 |    1003 | 2500.90 |        2 | 2022-09-08 00:00:00 |          2500.90 |          2500.90 |          2500.10 |
|       39 |    1003 | 2500.90 |        1 | 2022-12-12 00:00:00 |          2500.90 |          2500.90 |          2500.10 |
|       41 |    1003 | 6000.90 |        8 | 2023-01-10 00:00:00 |          6000.90 |          6000.90 |          2500.10 |
|       37 |    1003 | 2500.10 |        3 | 2023-02-14 00:00:00 |          2500.10 |          2500.10 |          2500.10 |
+----------+---------+---------+----------+---------------------+------------------+------------------+------------------+
10 rows in set (0.00 sec)

4.7. Other functions

4.7.1, NTILE () function

1. Grammatical description

  • NTILE() is used to divide a query result set into a specified number of buckets, and distribute data into each bucket according to the size of the buckets.
NTILE(bucket_size)
  1.  bucket_size: An integer parameter indicating the number of buckets to divide the result set into.

2. Execute the statement

select 
	*,  
	ntile(1) over(partition by user_id order by join_time desc) as alias_ntile1,
	ntile(2) over(partition by user_id order by join_time desc) as alias_ntile2,
	ntile(3) over(partition by user_id order by join_time desc) as alias_ntile3
from order_for_goods;
  •  The query uses the window function NTILE(), which can evenly distribute the data set into the specified number of buckets and return the bucket number to which each row belongs.
  •  Take the alias "alias_ntile3" as an example. In this query, ntile(3) means to divide each user into three groups, partition by user_id means to group by user_id, and order by join_time desc means to sort by join_time in descending order.
  • If it is ntile(2), it means to divide into two groups; ntile(1) means to divide into one group.

3. Execution results

 Explanation: The NTILE() function can evenly distribute the ordered data set to the specified number of buckets, and assign the bucket number to each row. If it cannot be evenly distributed, the bucket with a smaller bucket number will allocate additional rows, and the number of rows that can be placed in each bucket differs by at most 1.

4.7.2, NTH_VALUE () function

1. Grammatical description

  • The NTH_VALUE() function is a window function used in SQL to calculate the value of a specified position in an ordered data set.
NTH_VALUE(expression, nth_parameter)
  1.  expression: The expression to evaluate, which evaluates to a single value.
  2.  nth_parameter: It is an integer parameter, indicating the sequence number of the value to be calculated.

2. Execute the statement

select 
	*,  
    nth_value(money, 2) over(partition by user_id order by join_time ) as alias_nth_value
from order_for_goods;
  • Note that if a user has no data within the specified time range, the NTH_VALUE() function will return the default value NULL. 

3. Execution results

 

Guess you like

Origin blog.csdn.net/weixin_50002038/article/details/131011696