Use SQL window functions for business data analysis

This article will start with a business analysis case to illustrate the use of SQL window functions. Through the analysis of the five requirements in this article, it can be seen that the SQL window function is very powerful, which not only makes the SQL logic we write more clear, but also simplifies the development of requirements to a certain extent.

data preparation

The main analysis of this article only involves an order table orders, the operation process is completed in Hive, the specific data is as follows:

-- 建表
CREATE TABLE orders(
    order_id int,
    customer_id string,
    city string,
    add_time string,
    amount decimal(10,2));

-- 准备数据                              
INSERT INTO orders VALUES
(1,"A","上海","2020-01-01 00:00:00.000000",200),
(2,"B","上海","2020-01-05 00:00:00.000000",250),
(3,"C","北京","2020-01-12 00:00:00.000000",200),
(4,"A","上海","2020-02-04 00:00:00.000000",400),
(5,"D","上海","2020-02-05 00:00:00.000000",250),
(5,"D","上海","2020-02-05 12:00:00.000000",300),
(6,"C","北京","2020-02-19 00:00:00.000000",300),
(7,"A","上海","2020-03-01 00:00:00.000000",150),
(8,"E","北京","2020-03-05 00:00:00.000000",500),
(9,"F","上海","2020-03-09 00:00:00.000000",250),
(10,"B","上海","2020-03-21 00:00:00.000000",600);

Need 1: Revenue growth

In terms of business, the revenue growth in the m1 month is calculated as follows: 100 * (m1-m0)/m0

Among them, m1 is the income of a given month, and m0 is the income of the previous month. Therefore, technically speaking, we need to find the income of each month, and then somehow relate the income of each month to the previous income in order to perform the above calculations. The calculation was as follows:

WITH
monthly_revenue as (
    SELECT
    trunc(add_time,'MM') as month,
    sum(amount) as revenue
    FROM orders
    GROUP BY 1
)
,prev_month_revenue as (
    SELECT 
    month,
    revenue,
    lag(revenue) over (order by month) as prev_month_revenue -- 上一月收入
    FROM monthly_revenue
)
SELECT 
  month,
  revenue,
  prev_month_revenue,
  round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1

Result output

month revenue prev_month_revenue revenue_growth
2020-01-01 650 NULL NULL
2020-02-01 1250 650 92.3
2020-03-01 1500 1250 20

We can also group statistics by city to view the income growth of a certain city in a certain month

WITH
monthly_revenue as (
    SELECT
     trunc(add_time,'MM') as month,
    city,
    sum(amount) as revenue
    FROM orders
    GROUP BY 1,2
)
,prev_month_revenue as (
    SELECT 
    month,
    city,
    revenue,
    lag(revenue) over (partition by city order by month) as prev_month_revenue
    FROM monthly_revenue
)
SELECT 
month,
city,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 2,1

Result output

month city revenue revenue_growth
2020-01-01 Shanghai 450 NULL
2020-02-01 Shanghai 950 111.1
2020-03-01 Shanghai 1000 5.3
2020-01-01 Beijing 200 NULL
2020-02-01 Beijing 300 50
2020-03-01 Beijing 500 66.7

Demand 2: Cumulative sum

Cumulative summary, that is, the sum of the current element and all previous elements, such as the following SQL:

WITH
monthly_revenue as (
    SELECT
    trunc(add_time,'MM') as month,
    sum(amount) as revenue
    FROM orders
    GROUP BY 1
)
SELECT 
month,
revenue,
sum(revenue) over (order by month rows between unbounded preceding and current row) as running_total
FROM monthly_revenue
ORDER BY 1

Result output

month revenue running_total
2020-01-01 650 650
2020-02-01 1250 1900
2020-03-01 1500 3400

We can also use the following combination for analysis, the SQL is as follows:

SELECT
   order_id,
   customer_id,
   city,
   add_time,
   amount,
   sum(amount) over () as amount_total, -- 所有数据求和
   sum(amount) over (order by order_id rows between unbounded preceding and current row) as running_sum, -- 累计求和
   sum(amount) over (partition by customer_id order by add_time rows between unbounded    preceding and current row) as running_sum_by_customer, 
   avg(amount) over (order by add_time rows between 5 preceding and current row) as  trailing_avg -- 滚动求平均
FROM orders
ORDER BY 1

Result output :

order_id customer_id city add_time amount amount_total running_sum running_sum_by_customer trailing_avg
1 A Shanghai 2020-01-01 00:00:00.000000 200 3400 200 200 200
2 B Shanghai 2020-01-05 00:00:00.000000 250 3400 450 250 225
3 C Beijing 2020-01-12 00:00:00.000000 200 3400 650 200 216.666667
4 A Shanghai 2020-02-04 00:00:00.000000 400 3400 1050 600 262.5
5 D Shanghai 2020-02-05 00:00:00.000000 250 3400 1300 250 260
5 D Shanghai 2020-02-05 12:00:00.000000 300 3400 1600 550 266.666667
6 C Beijing 2020-02-19 00:00:00.000000 300 3400 1900 500 283.333333
7 A Shanghai 2020-03-01 00:00:00.000000 150 3400 2050 750 266.666667
8 E Beijing 2020-03-05 00:00:00.000000 500 3400 2550 500 316.666667
9 F Shanghai 2020-03-09 00:00:00.000000 250 3400 2800 250 291.666667
10 B Shanghai 2020-03-21 00:00:00.000000 600 3400 3400 850

Requirement 3: Deal with duplicate data

As can be seen from the above data, there are two duplicate data **(5,"D","Shanghai","2020-02-05 00:00:00.000000",250),
(5,"D", "Shanghai","2020-02-05 12:00:00.000000",300),** Obviously it needs to be cleaned and deduplicated, and the latest piece of data is retained. The SQL is as follows:

We first perform group ranking, and then keep the latest piece of data:

SELECT *
FROM (
    SELECT *,
    row_number() over (partition by order_id order by add_time desc) as rank
    FROM orders
) t
WHERE rank=1

Result output :

t.order_id t.customer_id t.city t.add_time t.amount t.rank
1 A Shanghai 2020-01-01 00:00:00.000000 200 1
2 B Shanghai 2020-01-05 00:00:00.000000 250 1
3 C Beijing 2020-01-12 00:00:00.000000 200 1
4 A Shanghai 2020-02-04 00:00:00.000000 400 1
5 D Shanghai 2020-02-05 12:00:00.000000 300 1
6 C Beijing 2020-02-19 00:00:00.000000 300 1
7 A Shanghai 2020-03-01 00:00:00.000000 150 1
8 E Beijing 2020-03-05 00:00:00.000000 500 1
9 F Shanghai 2020-03-09 00:00:00.000000 250 1
10 B Shanghai 2020-03-21 00:00:00.000000 600 1

After the above cleaning process, the data is deduplicated. Recalculate the above requirement 1, the correct SQL script is:

WITH
orders_cleaned as (
    SELECT *
    FROM (
        SELECT *,
        row_number() over (partition by order_id order by add_time desc) as rank
        FROM orders
    )t
    WHERE rank=1
)
,monthly_revenue as (
    SELECT
    trunc(add_time,'MM') as month,
    sum(amount) as revenue
    FROM orders_cleaned
    GROUP BY 1
)
,prev_month_revenue as (
    SELECT 
    month,
    revenue,
    lag(revenue) over (order by month) as prev_month_revenue
    FROM monthly_revenue
)
SELECT 
month,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1

Result output :

month revenue revenue_growth
2020-01-01 650 NULL
2020-02-01 1000 53.8
2020-03-01 1500 50

Create a view of the cleaned data for later use

CREATE VIEW orders_cleaned AS
SELECT
    order_id, 
    customer_id, 
    city, 
    add_time, 
    amount
FROM (
    SELECT *,
    row_number() over (partition by order_id order by add_time desc) as rank
    FROM orders
)t
WHERE rank=1

Requirement 4: Take TopN in groups

分组取topN是最长见的SQL窗口函数使用场景,下面的SQL是计算每个月份的top2订单金额,如下:

WITH orders_ranked as (
    SELECT
    trunc(add_time,'MM') as month,
    *,
    row_number() over (partition by trunc(add_time,'MM') order by amount desc, add_time) as rank
    FROM orders_cleaned
)
SELECT 
    month,
    order_id,
    customer_id,
    city,
    add_time,
    amount
FROM orders_ranked
WHERE rank <=2
ORDER BY 1

需求5:重复购买行为

下面的SQL计算重复购买率:重复购买的人数/总人数*100%以及第一笔订单金额与第二笔订单金额之间的典型差额:avg(第二笔订单金额/第一笔订单金额)

WITH customer_orders as (
    SELECT *,
    row_number() over (partition by customer_id order by add_time) as customer_order_n,
    lag(amount) over (partition by customer_id order by add_time) as prev_order_amount
    FROM orders_cleaned
)
SELECT
round(100.0*sum(case when customer_order_n=2 then 1 end)/count(distinct customer_id),1) as repeat_purchases,-- 重复购买率
avg(case when customer_order_n=2 then 1.0*amount/prev_order_amount end) as revenue_expansion -- 重复购买较上次购买差异,第一笔订单金额与第二笔订单金额之间的典型差额
FROM customer_orders

结果输出

WITH结果输出:

orders_cleaned.order_id orders_cleaned.customer_id orders_cleaned.city orders_cleaned.add_time orders_cleaned.amount customer_order_n prev_order_amount
1 A 上海 2020-01-01 00:00:00.000000 200 1 NULL
4 A 上海 2020-02-04 00:00:00.000000 400 2 200
7 A 上海 2020-03-01 00:00:00.000000 150 3 400
2 B 上海 2020-01-05 00:00:00.000000 250 1 NULL
10 B 上海 2020-03-21 00:00:00.000000 600 2 250
3 C 北京 2020-01-12 00:00:00.000000 200 1 NULL
6 C 北京 2020-02-19 00:00:00.000000 300 2 200
5 D 上海 2020-02-05 12:00:00.000000 300 1 NULL
8 E 北京 2020-03-05 00:00:00.000000 500 1 NULL
9 F 上海 2020-03-09 00:00:00.000000 250

最终结果输出:

repeat_purchases revenue_expansion
50 1.9666666666666668

总结

本文主要分享了SQL窗口函数的基本使用方式以及使用场景,并结合了具体的分析案例。通过本文的分析案例,可以加深对SQL窗口函数的理解。

公众号『大数据技术与数仓』,回复『资料』领取大数据资料包

Guess you like

Origin blog.csdn.net/jmx_bigdata/article/details/108433576