PostgreSQL自定义统计信息

pg和oracle的优化器一样是基于成本的(CBO)估算。成本估算中很重要的一个环节是估计每个执行节点返回的记录数。例如在hash join中一般都会选择记录较少的作为hash表。
pg中对于单列选择性的估算比较准确，而对于多列的情况则会出现不准确的情况，因为pg默认使用独立属性，直接以多个字段选择性相乘的方法计算多个字段条件的选择性。
pg10开始支持用户自定义统计信息，这样我们便可以针对这种多列的情况创建自定义多个字段的统计信息，目前支持多列相关性和多列唯一值两种统计。

语法：

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    [ ( statistics_kind [, ... ] ) ]
    ON column_name, column_name [, ...]
    FROM table_name

例子：
1、建表

bill=# create table tbl(id int, c1 int, c2 text, c3 int, c4 int, c5 int, c6 int, c7 int, c8 int, c9 int, c10 int);    
CREATE TABLE

2、插入测试数据

bill=# insert into tbl select 
bill-# id,random()*100, substring(md5(random()::text), 1, 4), random()*900, random()*10000, random()*10000000,     
bill-# random()*100000, random()*100, random()*200000, random()*40000, random()*90000   
bill-# from generate_series(1,1000000) t(id);
INSERT 0 1000000

3、分析表

bill=# analyze tbl ;
ANALYZE

查看规划器中的估计行数值为1e+06

bill=# select reltuples from pg_class where relname='tbl'; 
 reltuples 
-----------
     1e+06
(1 row)

4、查询例子

bill=#  explain (analyze) select * from tbl where c1=1; 
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..16654.33 rows=11000 width=45) (actual time=0.798..53.768 rows=10002 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tbl  (cost=0.00..14554.33 rows=4583 width=45) (actual time=0.043..45.096 rows=3334 loops=3)
         Filter: (c1 = 1)
         Rows Removed by Filter: 329999
 Planning Time: 0.212 ms
 Execution Time: 54.445 ms
(8 rows)

可以推算得到c1=1的选择性为: 11000/1e+06

bill=#  explain (analyze) select * from tbl where c2='abc'; 
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..15555.93 rows=16 width=45) (actual time=43.558..45.115 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tbl  (cost=0.00..14554.33 rows=7 width=45) (actual time=40.613..40.613 rows=0 loops=3)
         Filter: (c2 = 'abc'::text)
         Rows Removed by Filter: 333333
 Planning Time: 0.089 ms
 Execution Time: 45.144 ms
(8 rows)

可以推算得到c1=abc的选择性为: 16/1e+06

两个字段结合查询：

bill=# explain (analyze) select * from tbl where c1=1 and c2='abc';
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..16596.10 rows=1 width=45) (actual time=52.635..54.289 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tbl  (cost=0.00..15596.00 rows=1 width=45) (actual time=49.547..49.547 rows=0 loops=3)
         Filter: ((c1 = 1) AND (c2 = 'abc'::text))
         Rows Removed by Filter: 333333
 Planning Time: 0.120 ms
 Execution Time: 54.333 ms
(8 rows)

可以发现返回rows=1，这个值规划器是怎么估算的呢？在没有自定义统计信息时，是这么算的，算这两个条件完全不相干，所以选择性直接相乘：
(11000/1e+06 ) * (16/1e+06 ）* 1e+06 = 1.389202 ~= 1

单个字段唯一值：

bill=#  explain (analyze) select c1,count(*) from tbl group by c1;
                                                                 QUERY PLAN                                                         
         
------------------------------------------------------------------------------------------------------------------------------------
---------
 Finalize GroupAggregate  (cost=16600.40..16625.99 rows=101 width=12) (actual time=135.975..136.197 rows=101 loops=1)
   Group Key: c1
   ->  Gather Merge  (cost=16600.40..16623.97 rows=202 width=12) (actual time=135.964..137.830 rows=303 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=15600.37..15600.63 rows=101 width=12) (actual time=132.322..132.334 rows=101 loops=3)
               Sort Key: c1
               Sort Method: quicksort  Memory: 29kB
               Worker 0:  Sort Method: quicksort  Memory: 29kB
               Worker 1:  Sort Method: quicksort  Memory: 29kB
               ->  Partial HashAggregate  (cost=15596.00..15597.01 rows=101 width=12) (actual time=132.214..132.241 rows=101 loops=3
)
                     Group Key: c1
                     ->  Parallel Seq Scan on tbl  (cost=0.00..13512.67 rows=416667 width=4) (actual time=0.013..49.033 rows=333333 
loops=3)
 Planning Time: 0.081 ms
 Execution Time: 138.010 ms
(15 rows)

rows=101来自pg_stats.n_distinct , tbl.c1列的统计。

多个字段求唯一值：

bill=#  explain (analyze) select c1,c2,count(*) from tbl group by c1,c2;
                                                                 QUERY PLAN                                                         
         
------------------------------------------------------------------------------------------------------------------------------------
---------
 Finalize GroupAggregate  (cost=57780.75..88532.39 rows=100000 width=17) (actual time=399.335..1312.320 rows=927877 loops=1)
   Group Key: c1, c2
   ->  Gather Merge  (cost=57780.75..86032.39 rows=200000 width=17) (actual time=399.321..925.561 rows=974991 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial GroupAggregate  (cost=56780.73..61947.40 rows=100000 width=17) (actual time=383.192..627.076 rows=324997 loops=
3)
               Group Key: c1, c2
               ->  Sort  (cost=56780.73..57822.40 rows=416667 width=9) (actual time=383.178..477.162 rows=333333 loops=3)
                     Sort Key: c1, c2
                     Sort Method: external merge  Disk: 6296kB
                     Worker 0:  Sort Method: external merge  Disk: 6296kB
                     Worker 1:  Sort Method: external merge  Disk: 6072kB
                     ->  Parallel Seq Scan on tbl  (cost=0.00..13512.67 rows=416667 width=9) (actual time=0.012..65.770 rows=333333 
loops=3)
 Planning Time: 2.664 ms
 Execution Time: 1361.581 ms
(15 rows)

5、自定义统计信息举例
创建自定义统计信息，指定需要自定义统计的字段名，需要统计依赖性、唯一性（不指定则都统计）。

bill=# create statistics s1 on c1,c2,c3 from tbl;  
CREATE STATISTICS

需要analyze表之后才会生成自定义统计信息：

bill=# analyze tbl ;
ANALYZE
bill=# select * from pg_statistic_ext where stxname='s1';  
  oid   | stxrelid | stxname | stxnamespace | stxowner | stxkeys | stxkind 
--------+----------+---------+--------------+----------+---------+---------
 130877 |   130871 | s1      |        51039 |    16384 | 2 3 4   | {d,f,m}
(1 row)

创建完自定义统计信息后我们再试试上面的查询：

–单个字段查询
创建了多字段统计信息后，这两个条件在统计信息之列，所以可以用他们的依赖度来算组合AND条件的选择性。

bill=# explain (analyze) select * from tbl where c1=1 and c2='abc'; 
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..16597.00 rows=10 width=45) (actual time=49.590..51.071 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on tbl  (cost=0.00..15596.00 rows=4 width=45) (actual time=46.791..46.791 rows=0 loops=3)
         Filter: ((c1 = 1) AND (c2 = 'abc'::text))
         Rows Removed by Filter: 333333
 Planning Time: 0.207 ms
 Execution Time: 51.098 ms
(8 rows)

–多个字段条件求唯一值：

bill=# explain (analyze) select c1,c2,count(*) from tbl group by c1,c2;  
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=129502.29..148187.51 rows=868522 width=17) (actual time=1007.860..1819.762 rows=927877 loops=1)
   Group Key: c1, c2
   ->  Sort  (cost=129502.29..132002.29 rows=1000000 width=9) (actual time=1007.844..1383.949 rows=1000000 loops=1)
         Sort Key: c1, c2
         Sort Method: external merge  Disk: 18640kB
         ->  Seq Scan on tbl  (cost=0.00..19346.00 rows=1000000 width=9) (actual time=0.011..162.332 rows=1000000 loops=1)
 Planning Time: 0.116 ms
 Execution Time: 1869.341 ms
(8 rows)

总结：
1、PostgreSQL10开始支持自定义多列统计信息，目前支持多列组合唯一值、列与列的相关性。
2、多列唯一值可用于评估group by, count(distinct)等。
3、列与列相关性可用于估算多个列AND条件的选择性。算法为

a = ? and b = ? 的选择性  :   
min( "选择性(a) * (a=>b)" , "选择性(b) * (b=>a)" )

4、由于多列统计信息的组合很多，因此数据库默认只统计单列的柱状图。当用户意识到某些列会作为组合查询列时，再创建自定义多列统计信息即可。

foucus、

发布了70 篇原创文章 · 获赞 5 · 访问量 3119

私信关注

PostgreSQL自定义统计信息

猜你喜欢