我们在实际应用中,常常会碰到这样的场景:分组取top值,比如取某一部门收入前10的人。
一般我们都会用到窗口函数去解决这类问题,同样在pg中也支持窗口函数。
例子:
创建测试表,生成10000个分组,1000万条记录。
bill=# create table tbl(c1 int, c2 int, c3 int);
CREATE TABLE
bill=# create index idx1 on tbl(c1,c2);
CREATE INDEX
bill=# insert into tbl select mod(trunc(random()*10000)::int, 10000), trunc(random()*10000000) from generate_series(1,10000000);
INSERT 0 10000000
使用窗口查询:
bill=# select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
rn | c1 | c2 | c3
----+------+--------+----
1 | 0 | 9119 |
2 | 0 | 13019 |
3 | 0 | 44353 |
4 | 0 | 49534 |
5 | 0 | 54353 |
6 | 0 | 61538 |
7 | 0 | 95604 |
8 | 0 | 95902 |
9 | 0 | 102053 |
......
查看窗口查询的执行计划:
可以看到效率并不高,需要22.3秒。
bill=# explain (analyze,verbose,costs,timing,buffers) select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t (cost=0.43..554809.35 rows=3333349 width=20) (actual time=0.058..22763.040 rows=100000 loops=1)
Output: t.rn, t.c1, t.c2, t.c3
Filter: (t.rn <= 10)
Rows Removed by Filter: 9900000
Buffers: shared hit=10006507 read=29522
-> WindowAgg (cost=0.43..429808.75 rows=10000048 width=20) (actual time=0.055..21942.035 rows=10000000 loops=1)
Output: row_number() OVER (?), tbl.c1, tbl.c2, tbl.c3
Buffers: shared hit=10006507 read=29522
-> Index Scan using idx1 on public.tbl (cost=0.43..254807.91 rows=10000048 width=12) (actual time=0.032..16257.536 rows=10000000 loops=1)
Output: tbl.c1, tbl.c2, tbl.c3
Buffers: shared hit=10006507 read=29522
Planning Time: 0.240 ms
Execution Time: 22771.087 ms
(13 rows)
优化方法:
这里我们可以使用pg中的with递归来优化这一类查询。
bill=# with recursive t1 as (
bill(# (select min(c1) c1 from tbl )
bill(# union all
bill(# (select (select min(tbl.c1) c1 from tbl where tbl.c1>t.c1) c1 from t1 t where t.c1 is not null)
bill(# )
bill-# select * from t1 ;
可以写成SRF函数:
bill=# create or replace function f_rec() returns setof tbl as $$
bill$# declare
bill$# v int;
bill$# begin
bill$# for v in with recursive t1 as (
bill$# (select min(c1) c1 from tbl )
bill$# union all
bill$# (select (select min(tbl.c1) c1 from tbl where tbl.c1>t.c1) c1 from t1 t where t.c1 is not null)
bill$# )
bill$# select * from t1
bill$# LOOP
bill$# return query select * from tbl where c1=v order by c2 limit 10;
bill$# END LOOP;
bill$# return;
bill$#
bill$# end;
bill$# $$ language plpgsql strict;
CREATE FUNCTION
查询:
bill=# select * from f_rec();
c1 | c2 | c3
------+--------+----
0 | 9119 |
0 | 13019 |
0 | 44353 |
0 | 49534 |
0 | 54353 |
0 | 61538 |
0 | 95604 |
0 | 95902 |
......
优化后的执行计划:
优化后只需要500毫秒,效率提升巨大!
bill=# explain (analyze,verbose,timing,costs,buffers) select * from f_rec();
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Function Scan on public.f_rec (cost=0.25..10.25 rows=1000 width=12) (actual time=487.235..493.814 rows=100000 loops=1)
Output: c1, c2, c3
Function Call: f_rec()
Buffers: shared hit=171949
Planning Time: 0.072 ms
Execution Time: 500.658 ms
(6 rows)
小结:
传统的窗口查询,因为需要扫描所有数据,所以会导致效率不高。而使用递归的方法,可以用上索引,因此性能提高很多。
需要注意:
pg的递归语法不支持递归的启动表写在subquery里面,也不支持启动表在递归查询中使用order by,所以不能直接使用递归得出结果,所以需要像前面的例子中一样套一层函数。
参考链接:
https://www.postgresql.org/docs/12/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS