Window functions revealed! Easily calculate the cumulative proportion of data, an excellent tool for data analysis

In the previous article "How to Use Window Functions to Implement Ranking Calculation", the editor introduced the application of window functions in ranking calculation scenarios. However, in fact, in addition to performing single-row calculations, window functions can also open a specified row on each row. The size of the calculation window. This calculation window can be specified by a statement in SQL. It can be as large as the entire partition scope or as small as an offset row specified by the current row (such as the previous row and next row of the current row. The entire calculation window is called frame). Today, the editor will introduce to you the application of window functions in cumulative analysis scenarios.

It should be noted that if your database version is earlier than the following version, you will not be able to use the window functions used in the article.

1.Mysql (>=8.0)

2. PostgreSQL(>=11)
3. SQL Server(>=2012)
4. Oracle(>=8i)
5. SQLite(>=3.28.0)

Demand background

Like the previous article, in order to give everyone a better understanding, I will use the factory's consumables loss data as the query condition background: Suppose that a factory has just completed a consumables processing, and recorded the consumables during the processing. Category, daily recording time, daily consumable consumption and consumable supply at the beginning of the month, as shown in the following table:

Now the boss of this company wants to take a look:

1. The daily cumulative consumption of each consumable.

2. The daily balance of each consumable for the current month.

3. The cumulative monthly consumption proportion of each consumable.

Query the daily cumulative consumption of each consumable

Execute the following SQL statement.

select cate,record_date,init_value,SUM(cost) over(partition by cate,MONTH(record_date) order by record_date ) as cm_cost

from material_data md;

It can be seen that through the above SQL query, the monthly cumulative daily consumption of each category has been obtained. Here we explain the key parts of SQL:

SUM(cost) over(partition by cate,MONTH(record_date) order by record_date )

As we introduced in the previous article, partition by specifies the calculation partition, and order by determines the row order of calculation. Who will complete the cumulative effect? ​​Here, I will slightly modify the SQL just now and it will be better. clear.

select cate,record_date,init_value,SUM(cost) over(partition by cate,MONTH(record_date) order by record_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) as cm_cost

from material_data md;

The modified SQL has the same effect as the original query SQL. We can see that the modified SQL adds a piece of code after order by:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

The editor will dismantle this code for you. The first ROWS indicates that the next Frame window is designated as row mode. The BETWEEN keyword indicates that the effect of the next statement is to specify the window range. UNBOUNDED and PRECEDING are a combination of two keywords. The former means that the boundary of the calculation window in the ↑ direction is the top, which corresponds to the calculation field in June in the partition by partition. UNBOUNDED PRECEDING means that the upper bound of the window for each row in June is the minimum value in the order by record_date sequence, that is, 2023/ For the record No. 06/01, the following AND CURRENT ROW specifies that the ↓ boundary of the calculation frame window is the current row. Finally, we reorganize this calculation window. Under the calculation partition of each category of each month, the calculation window of each row is from the minimum date of this month to all records of the current row, and it is linked to the initial SUM (cost) aggregation. You can understand why this SQL can calculate the corresponding cumulative value.

Here we can expand on the explanation. In addition to UNBOUNDED PRECEDING and CURRENT ROW, the keywords used to determine the size of the calculation window are UNBOUNDED FOLLOWING. If UNBOUNDED PRECEDING represents the top of the upper boundary, then UNBOUNDED FOLLOWING represents the bottom of the lower boundary. Therefore, if the calculation window is specified as ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, it means that the aggregation operation is performed in the entire partition calculation domain. In addition, UNBOUNDED is actually not necessary. It can be replaced with any number to represent the number of offset rows for the current row. For example, 1 PRECEDING represents the previous row of the current row, and 1 FOLLOWING represents the next row of the current row. By specifying the calculation window as ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING, we can calculate the cumulative total of the three rows from the previous row to the next row for each row. value. As for CURRENT ROW, it is designated as the current row, which is also the key to doing cumulative sum.
Similarly, aggregate functions such as MAX() and AVG() also apply to the above rules. We can calculate aggregate values ​​such as the maximum value and average value within the specified window of each row.

Query the daily balance of each consumable for the current month

Query Sql:

select

cate,

record_date,

init_value,

init_value - SUM(cost) over(partition by cate,MONTH(record_date) order by record_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) as material_num

from material_data md;

It can also be abbreviated as

select

cate,

record_date,

init_value,

init_value - SUM(cost) over(partition by cate,MONTH(record_date) order by record_date ) as material_num

from material_data md;

Query the monthly cumulative consumption proportion of each consumable

select

md.cate,

record_date,

init_value,

cost/ sum(cost) over(partition by cate,MONTH(record_date) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as cm_cost

from material_data md

Similarly, it can be abbreviated as:

select

md.cate,

record_date,

init_value,

cost/ sum(cost) over(partition by cate,MONTH(record_date)) as cm_cost

from material_data md

Then you can dig out actual business scenarios based on the daily consumption ratio and track abnormal consumption data accordingly.


Summarize

Cumulative calculation is also the most frequently used scenario of window functions in business scenarios, especially the cumulative ranking of sales business, daily consumption of business equipment, daily balance alarms and other scenarios. I hope it can be helpful to you. There are more rich features regarding the flexible adjustment of the frame calculation window. The offset calculation scenario will be introduced in the follow-up (Part 3).

Extension link:

How to quickly realize collaborative editing by multiple people?

Customize handwritten signature in Excel

Advanced SQL analysis functions-window function (1)-ranking calculation

Guess you like

Origin blog.csdn.net/powertoolsteam/article/details/132414069