PostgreSQL窗口查询优化

我们在实际应用中,常常会碰到这样的场景:分组取top值,比如取某一部门收入前10的人。
一般我们都会用到窗口函数去解决这类问题,同样在pg中也支持窗口函数。

例子:
创建测试表,生成10000个分组,1000万条记录。

bill=#  create table tbl(c1 int, c2 int, c3 int);  
CREATE TABLE  
bill=# create index idx1 on tbl(c1,c2);  
CREATE INDEX  
bill=# insert into tbl select mod(trunc(random()*10000)::int, 10000), trunc(random()*10000000) from generate_series(1,10000000);  
INSERT 0 10000000 

使用窗口查询:

bill=# select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;  
 rn |  c1  |   c2   | c3 
----+------+--------+----
  1 |    0 |   9119 |   
  2 |    0 |  13019 |   
  3 |    0 |  44353 |   
  4 |    0 |  49534 |   
  5 |    0 |  54353 |   
  6 |    0 |  61538 |   
  7 |    0 |  95604 |   
  8 |    0 |  95902 |   
  9 |    0 | 102053 |   
......

查看窗口查询的执行计划:
可以看到效率并不高,需要22.3秒。

bill=# explain (analyze,verbose,costs,timing,buffers) select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;  
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on t  (cost=0.43..554809.35 rows=3333349 width=20) (actual time=0.058..22763.040 rows=100000 loops=1)
   Output: t.rn, t.c1, t.c2, t.c3
   Filter: (t.rn <= 10)
   Rows Removed by Filter: 9900000
   Buffers: shared hit=10006507 read=29522
   ->  WindowAgg  (cost=0.43..429808.75 rows=10000048 width=20) (actual time=0.055..21942.035 rows=10000000 loops=1)
         Output: row_number() OVER (?), tbl.c1, tbl.c2, tbl.c3
         Buffers: shared hit=10006507 read=29522
         ->  Index Scan using idx1 on public.tbl  (cost=0.43..254807.91 rows=10000048 width=12) (actual time=0.032..16257.536 rows=10000000 loops=1)
               Output: tbl.c1, tbl.c2, tbl.c3
               Buffers: shared hit=10006507 read=29522
 Planning Time: 0.240 ms
 Execution Time: 22771.087 ms
(13 rows)

优化方法:
这里我们可以使用pg中的with递归来优化这一类查询。

bill=# with recursive t1 as (                                                                             
bill(#      (select min(c1) c1 from tbl )                                                                     
bill(#       union all                                                                                        
bill(#      (select (select min(tbl.c1) c1 from tbl where tbl.c1>t.c1) c1 from t1 t where t.c1 is not null)   
bill(#     )                                                                                                  
bill-#     select * from t1 ;

可以写成SRF函数:

bill=# create or replace function f_rec() returns setof tbl as $$  
bill$#   declare  
bill$#     v int;  
bill$#   begin  
bill$#     for v in with recursive t1 as (                                                                             
bill$#      (select min(c1) c1 from tbl )                                                                     
bill$#       union all                                                                                        
bill$#      (select (select min(tbl.c1) c1 from tbl where tbl.c1>t.c1) c1 from t1 t where t.c1 is not null)   
bill$#     )                                                                                                  
bill$#     select * from t1  
bill$#     LOOP  
bill$#       return query select * from tbl where c1=v order by c2 limit 10;  
bill$#     END LOOP;  
bill$#   return;  
bill$#     
bill$#   end;  
bill$#   $$ language plpgsql strict;  
CREATE FUNCTION

查询:

bill=# select * from f_rec();
  c1  |   c2   | c3 
------+--------+----
    0 |   9119 |   
    0 |  13019 |   
    0 |  44353 |   
    0 |  49534 |   
    0 |  54353 |   
    0 |  61538 |   
    0 |  95604 |   
    0 |  95902 |   
......

优化后的执行计划:
优化后只需要500毫秒,效率提升巨大!

bill=# explain (analyze,verbose,timing,costs,buffers) select * from f_rec();
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Function Scan on public.f_rec  (cost=0.25..10.25 rows=1000 width=12) (actual time=487.235..493.814 rows=100000 loops=1)
   Output: c1, c2, c3
   Function Call: f_rec()
   Buffers: shared hit=171949
 Planning Time: 0.072 ms
 Execution Time: 500.658 ms
(6 rows)

小结:
传统的窗口查询,因为需要扫描所有数据,所以会导致效率不高。而使用递归的方法,可以用上索引,因此性能提高很多。
需要注意:
pg的递归语法不支持递归的启动表写在subquery里面,也不支持启动表在递归查询中使用order by,所以不能直接使用递归得出结果,所以需要像前面的例子中一样套一层函数。

参考链接:
https://www.postgresql.org/docs/12/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS

发布了115 篇原创文章 · 获赞 49 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/weixin_39540651/article/details/104945058