PostgreSQL使用hash分组进行分页优化

我们在使用数据库进行分页查询时，随着offset过滤的数据越来越多，查询也会越来越慢，因为即使你只通过查询获取1条记录也需要遍历offset过滤掉的数据。
一般来说可以通过在表中建立“位点”的方式去优化这类分页查询，但是某些情况下我们可能没法在表上建立“位点”。

例如：

bill=# create table t(id int8, info float4, crt_time timestamp);    
CREATE TABLE
bill=# insert into t select generate_series(1,10000000), random(), clock_timestamp();    
INSERT 0 10000000
bill=# create index idx_t_1 on t(info);    
CREATE INDEX

我们在表t的info字段存入的是大量随机数用来模拟数据不断变化的情况，这种场景下我们可能不太适合使用“位点”。

当我们执行下面查询时：

select * from t order by info desc offset 200000 limit 100;

从offset过滤了200000条数据，很多场景下我们offset过滤掉的数据可能都是不需要的数据(例如客户端已读的)，随着offset列表越来越大, 每次按info倒排查出来的记录有大量是已读的，那么这些不需要的数据就会使查询变慢。

一般情况下：耗时219ms

bill=# explain analyze select * from t order by info desc offset 200000 limit 100;    
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=5369.60..5372.29 rows=100 width=20) (actual time=219.111..219.224 rows=100 loops=1)
   ->  Index Scan Backward using idx_t_1 on t  (cost=0.43..268462.02 rows=10000115 width=20) (actual time=0.037..208.732 rows=200100 loops=1)
 Planning Time: 0.197 ms
 Execution Time: 219.247 ms
(4 rows)

优化思路：
因为offset过滤掉的数据很多是无用的，那么我们可以通过hash分组的方式来减少无用的数据。

例如按20个分区索引来进行随机选择：

bill=# do language plpgsql $$  
bill$# declare  
bill$#  sql text;  
bill$# begin  
bill$#   for i in 0..19 loop  
bill$#     sql := format($_$  
bill$#       create index idx_t_p_%s on t(info) where mod(abs(hashint8(id)),20)=%s;  
bill$#     $_$, i, i);  
bill$#     execute sql;  
bill$#   end loop;  
bill$# end;  
bill$# $$; 
DO

查询的范围将缩小到20分之一, 那么offset量就会降低20倍. 性能明显提升：
耗时19ms

bill=# explain analyze select * from t   
bill-# where mod(abs(hashint8(id)),20) = 0   
bill-# order by info desc offset 10000 limit 100;  
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9435.98..9530.34 rows=100 width=20) (actual time=19.322..19.532 rows=100 loops=1)
   ->  Index Scan Backward using idx_t_p_0 on t  (cost=0.38..47179.35 rows=50001 width=20) (actual time=0.050..18.741 rows=10100 loops=1)
 Planning Time: 0.726 ms
 Execution Time: 19.563 ms
(4 rows)

PostgreSQL使用hash分组进行分页优化

猜你喜欢