This article will start with a business analysis case to illustrate the use of SQL window functions. Through the analysis of the five requirements in this article, it can be seen that the SQL window function is very powerful, which not only makes the SQL logic we write more clear, but also simplifies the development of requirements to a certain extent.
data preparation
The main analysis of this article only involves an order table orders, the operation process is completed in Hive, the specific data is as follows:
-- 建表
CREATE TABLE orders(
order_id int,
customer_id string,
city string,
add_time string,
amount decimal(10,2));
-- 准备数据
INSERT INTO orders VALUES
(1,"A","上海","2020-01-01 00:00:00.000000",200),
(2,"B","上海","2020-01-05 00:00:00.000000",250),
(3,"C","北京","2020-01-12 00:00:00.000000",200),
(4,"A","上海","2020-02-04 00:00:00.000000",400),
(5,"D","上海","2020-02-05 00:00:00.000000",250),
(5,"D","上海","2020-02-05 12:00:00.000000",300),
(6,"C","北京","2020-02-19 00:00:00.000000",300),
(7,"A","上海","2020-03-01 00:00:00.000000",150),
(8,"E","北京","2020-03-05 00:00:00.000000",500),
(9,"F","上海","2020-03-09 00:00:00.000000",250),
(10,"B","上海","2020-03-21 00:00:00.000000",600);
Need 1: Revenue growth
In terms of business, the revenue growth in the m1 month is calculated as follows: 100 * (m1-m0)/m0
Among them, m1 is the income of a given month, and m0 is the income of the previous month. Therefore, technically speaking, we need to find the income of each month, and then somehow relate the income of each month to the previous income in order to perform the above calculations. The calculation was as follows:
WITH
monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
sum(amount) as revenue
FROM orders
GROUP BY 1
)
,prev_month_revenue as (
SELECT
month,
revenue,
lag(revenue) over (order by month) as prev_month_revenue -- 上一月收入
FROM monthly_revenue
)
SELECT
month,
revenue,
prev_month_revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1
Result output
month | revenue | prev_month_revenue | revenue_growth |
---|---|---|---|
2020-01-01 | 650 | NULL | NULL |
2020-02-01 | 1250 | 650 | 92.3 |
2020-03-01 | 1500 | 1250 | 20 |
We can also group statistics by city to view the income growth of a certain city in a certain month
WITH
monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
city,
sum(amount) as revenue
FROM orders
GROUP BY 1,2
)
,prev_month_revenue as (
SELECT
month,
city,
revenue,
lag(revenue) over (partition by city order by month) as prev_month_revenue
FROM monthly_revenue
)
SELECT
month,
city,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 2,1
Result output
month | city | revenue | revenue_growth |
---|---|---|---|
2020-01-01 | Shanghai | 450 | NULL |
2020-02-01 | Shanghai | 950 | 111.1 |
2020-03-01 | Shanghai | 1000 | 5.3 |
2020-01-01 | Beijing | 200 | NULL |
2020-02-01 | Beijing | 300 | 50 |
2020-03-01 | Beijing | 500 | 66.7 |
Demand 2: Cumulative sum
Cumulative summary, that is, the sum of the current element and all previous elements, such as the following SQL:
WITH
monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
sum(amount) as revenue
FROM orders
GROUP BY 1
)
SELECT
month,
revenue,
sum(revenue) over (order by month rows between unbounded preceding and current row) as running_total
FROM monthly_revenue
ORDER BY 1
Result output
month | revenue | running_total |
---|---|---|
2020-01-01 | 650 | 650 |
2020-02-01 | 1250 | 1900 |
2020-03-01 | 1500 | 3400 |
We can also use the following combination for analysis, the SQL is as follows:
SELECT
order_id,
customer_id,
city,
add_time,
amount,
sum(amount) over () as amount_total, -- 所有数据求和
sum(amount) over (order by order_id rows between unbounded preceding and current row) as running_sum, -- 累计求和
sum(amount) over (partition by customer_id order by add_time rows between unbounded preceding and current row) as running_sum_by_customer,
avg(amount) over (order by add_time rows between 5 preceding and current row) as trailing_avg -- 滚动求平均
FROM orders
ORDER BY 1
Result output :
order_id | customer_id | city | add_time | amount | amount_total | running_sum | running_sum_by_customer | trailing_avg |
---|---|---|---|---|---|---|---|---|
1 | A | Shanghai | 2020-01-01 00:00:00.000000 | 200 | 3400 | 200 | 200 | 200 |
2 | B | Shanghai | 2020-01-05 00:00:00.000000 | 250 | 3400 | 450 | 250 | 225 |
3 | C | Beijing | 2020-01-12 00:00:00.000000 | 200 | 3400 | 650 | 200 | 216.666667 |
4 | A | Shanghai | 2020-02-04 00:00:00.000000 | 400 | 3400 | 1050 | 600 | 262.5 |
5 | D | Shanghai | 2020-02-05 00:00:00.000000 | 250 | 3400 | 1300 | 250 | 260 |
5 | D | Shanghai | 2020-02-05 12:00:00.000000 | 300 | 3400 | 1600 | 550 | 266.666667 |
6 | C | Beijing | 2020-02-19 00:00:00.000000 | 300 | 3400 | 1900 | 500 | 283.333333 |
7 | A | Shanghai | 2020-03-01 00:00:00.000000 | 150 | 3400 | 2050 | 750 | 266.666667 |
8 | E | Beijing | 2020-03-05 00:00:00.000000 | 500 | 3400 | 2550 | 500 | 316.666667 |
9 | F | Shanghai | 2020-03-09 00:00:00.000000 | 250 | 3400 | 2800 | 250 | 291.666667 |
10 | B | Shanghai | 2020-03-21 00:00:00.000000 | 600 | 3400 | 3400 | 850 |
Requirement 3: Deal with duplicate data
As can be seen from the above data, there are two duplicate data **(5,"D","Shanghai","2020-02-05 00:00:00.000000",250),
(5,"D", "Shanghai","2020-02-05 12:00:00.000000",300),** Obviously it needs to be cleaned and deduplicated, and the latest piece of data is retained. The SQL is as follows:
We first perform group ranking, and then keep the latest piece of data:
SELECT *
FROM (
SELECT *,
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
) t
WHERE rank=1
Result output :
t.order_id | t.customer_id | t.city | t.add_time | t.amount | t.rank |
---|---|---|---|---|---|
1 | A | Shanghai | 2020-01-01 00:00:00.000000 | 200 | 1 |
2 | B | Shanghai | 2020-01-05 00:00:00.000000 | 250 | 1 |
3 | C | Beijing | 2020-01-12 00:00:00.000000 | 200 | 1 |
4 | A | Shanghai | 2020-02-04 00:00:00.000000 | 400 | 1 |
5 | D | Shanghai | 2020-02-05 12:00:00.000000 | 300 | 1 |
6 | C | Beijing | 2020-02-19 00:00:00.000000 | 300 | 1 |
7 | A | Shanghai | 2020-03-01 00:00:00.000000 | 150 | 1 |
8 | E | Beijing | 2020-03-05 00:00:00.000000 | 500 | 1 |
9 | F | Shanghai | 2020-03-09 00:00:00.000000 | 250 | 1 |
10 | B | Shanghai | 2020-03-21 00:00:00.000000 | 600 | 1 |
After the above cleaning process, the data is deduplicated. Recalculate the above requirement 1, the correct SQL script is:
WITH
orders_cleaned as (
SELECT *
FROM (
SELECT *,
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
)t
WHERE rank=1
)
,monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
sum(amount) as revenue
FROM orders_cleaned
GROUP BY 1
)
,prev_month_revenue as (
SELECT
month,
revenue,
lag(revenue) over (order by month) as prev_month_revenue
FROM monthly_revenue
)
SELECT
month,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1
Result output :
month | revenue | revenue_growth |
---|---|---|
2020-01-01 | 650 | NULL |
2020-02-01 | 1000 | 53.8 |
2020-03-01 | 1500 | 50 |
Create a view of the cleaned data for later use
CREATE VIEW orders_cleaned AS
SELECT
order_id,
customer_id,
city,
add_time,
amount
FROM (
SELECT *,
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
)t
WHERE rank=1
Requirement 4: Take TopN in groups
分组取topN是最长见的SQL窗口函数使用场景,下面的SQL是计算每个月份的top2订单金额,如下:
WITH orders_ranked as (
SELECT
trunc(add_time,'MM') as month,
*,
row_number() over (partition by trunc(add_time,'MM') order by amount desc, add_time) as rank
FROM orders_cleaned
)
SELECT
month,
order_id,
customer_id,
city,
add_time,
amount
FROM orders_ranked
WHERE rank <=2
ORDER BY 1
需求5:重复购买行为
下面的SQL计算重复购买率:重复购买的人数/总人数*100%以及第一笔订单金额与第二笔订单金额之间的典型差额:avg(第二笔订单金额/第一笔订单金额)
WITH customer_orders as (
SELECT *,
row_number() over (partition by customer_id order by add_time) as customer_order_n,
lag(amount) over (partition by customer_id order by add_time) as prev_order_amount
FROM orders_cleaned
)
SELECT
round(100.0*sum(case when customer_order_n=2 then 1 end)/count(distinct customer_id),1) as repeat_purchases,-- 重复购买率
avg(case when customer_order_n=2 then 1.0*amount/prev_order_amount end) as revenue_expansion -- 重复购买较上次购买差异,第一笔订单金额与第二笔订单金额之间的典型差额
FROM customer_orders
结果输出:
WITH结果输出:
orders_cleaned.order_id | orders_cleaned.customer_id | orders_cleaned.city | orders_cleaned.add_time | orders_cleaned.amount | customer_order_n | prev_order_amount |
---|---|---|---|---|---|---|
1 | A | 上海 | 2020-01-01 00:00:00.000000 | 200 | 1 | NULL |
4 | A | 上海 | 2020-02-04 00:00:00.000000 | 400 | 2 | 200 |
7 | A | 上海 | 2020-03-01 00:00:00.000000 | 150 | 3 | 400 |
2 | B | 上海 | 2020-01-05 00:00:00.000000 | 250 | 1 | NULL |
10 | B | 上海 | 2020-03-21 00:00:00.000000 | 600 | 2 | 250 |
3 | C | 北京 | 2020-01-12 00:00:00.000000 | 200 | 1 | NULL |
6 | C | 北京 | 2020-02-19 00:00:00.000000 | 300 | 2 | 200 |
5 | D | 上海 | 2020-02-05 12:00:00.000000 | 300 | 1 | NULL |
8 | E | 北京 | 2020-03-05 00:00:00.000000 | 500 | 1 | NULL |
9 | F | 上海 | 2020-03-09 00:00:00.000000 | 250 |
最终结果输出:
repeat_purchases | revenue_expansion |
---|---|
50 | 1.9666666666666668 |
总结
本文主要分享了SQL窗口函数的基本使用方式以及使用场景,并结合了具体的分析案例。通过本文的分析案例,可以加深对SQL窗口函数的理解。
公众号『大数据技术与数仓』,回复『资料』领取大数据资料包