Postgresql - 执行计划如何使用统计信息

How the Planner Uses Statistics

环境：

CentOS 7

PG 10.4

初始化环境：

建表，并插入数据

mytest=# create sequence seq_test0629_id;

CREATE SEQUENCE

mytest=# create table test0629 (id int not null default nextval('seq_test0629_id'::regclass), col1 varchar(128), col2 varchar(128) primary key (id) );

CREATE TABLE

mytest=# create or replace function insert_test0629(x int) returns void as $$

mytest$# declare i int;

mytest$# begin

mytest$# i:=1;

mytest$# for i in 1..x loop

mytest$# insert into test0629(col1,col2) values( repeat('Abcdefg',2),repeat('Hijklmn',2) );

mytest$# i = i+1;

mytest$# end loop;

mytest$# end;

mytest$# $$ language plpgsql;

CREATE FUNCTION

mytest=# select insert_test0629(1000);

insert_test0629

-----------------

(1 row)

mytest=# select count(*) from test0629 ;

count

-------

1000

(1 row)

开始正文：

首先查看一个一个最简单的查询执行计划

mytest=# explain select * from test0629 ;

QUERY PLAN

-------------------------------------------------------------

Seq Scan on test0629 (cost=0.00..19.00 rows=1000 width=34)

(1 row)

执行计划是如何确定表的cardinality的，

mytest=# SELECT relname, relkind, reltuples, relpages

FROM pg_class

WHERE relname like 'test0629%';

relname | relkind | reltuples | relpages

---------------+---------+-----------+----------

test0629 | r | 1000 | 9

test0629_pkey | i | 1000 | 5

(2 rows)

这些数字是最后一次在这张表上所执行VACUUM或ANALYZE所得到的。之后，planner获取表中实际表的page数量（不需要对表进行扫描）。

如果这些和relpages不一致，会相应的调整reltuples

mytest=# EXPLAIN SELECT * FROM test0629 WHERE id < 200;

QUERY PLAN

----------------------------------------------------------------------------------

Index Scan using test0629_pkey on test0629 (cost=0.28..12.78 rows=200 width=34)

Index Cond: (id < 200)

(2 rows)

planer分析where子句条件，并且查找运算符在pg_operator。这是获取pg_operator.opreest中的function，这个function 将从pg_statistic中检索列的柱状图。

select * from pg_operator where oprname = '<';

---------+--------------+----------+---------+-------------+------------+---------+----------+-----------+--------+-----------+--------------------------+-------------+----------------

< | 11 | 10 | b | f | f | 23 | 20 | 16 | 419 | 82 | int48lt | scalarltsel | scalarltjoinsel

.......

mytest=# SELECT histogram_bounds FROM pg_stats

mytest-# WHERE tablename='test0629' AND attname='id';

histogram_bounds

-------------------------------------------------------------------------------------------------------------------------

{1,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,4

80,490,500,510,520,530,540,550,560,570,580,590,600,610,620,630,640,650,660,670,680,690,700,710,720,730,740,750,760,770,780,790,800,810,820,830,840,850,860,870,880,890,900,910,920,930,9

40,950,960,970,980,990,1000}

之后求出<200所占据直方图的部分，这就是选择性。直方图将值平均分到buckets，所以我们要做的是定位值所在的bucket，对它的一部分count，并对之前所有部分进行count。

假设每个桶内的值的线性分布，我们可以计算的选择性为：

selectivity = (1 + (200 - bucket[19].min)/(bucket[19].max - bucket[19].min))/num_buckets = (1 + (200 - 190)/(200 - 190))/10 = 0.2

也就是说，20个完整的桶加上第二个线性部分，除以桶的数量。估计的行数现在可以计算为test0629的选择性和基数的乘积：

rows = rel_cardinality * selectivity = 1000 * 0.2 = 200 (rounding off)

如果是等于的话，该如何选择：

mytest=# explain select * from test0629 where id = 1;

QUERY PLAN

-------------------------------------------------------------------------------

Index Scan using test0629_pkey on test0629 (cost=0.28..8.29 rows=1 width=34)

Index Cond: (id = 1)

(2 rows)

换一个列看

mytest=# explain select * from test0629 where col1 = 'Abcdefg';

QUERY PLAN

-----------------------------------------------------------------------------------

Index Scan using test0629_col1_idx on test0629 (cost=0.28..4.29 rows=1 width=34)

Index Cond: ((col1)::text = 'Abcdefg'::text)

(2 rows)

mytest=# SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats

mytest-# WHERE tablename='test0629' AND attname='col1';

null_frac | n_distinct | most_common_vals | most_common_freqs

-----------+------------+------------------+-------------------

0 | 1 | {AbcdefgAbcdefg} | {1}

(1 row)

mytest=# update test0629 set col1 = 'ABCDEFGABCDEFG' where mod(id,5) = 0;

UPDATE 200

mytest=# update test0629 set col1 = 'ABCDEFGabcdefg' where mod(id,5) = 1;

UPDATE 200

mytest=# update test0629 set col1 = 'abcdefgABCDEFG' where mod(id,5) = 2;

UPDATE 200

mytest=# SELECT attname, null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats WHERE tablename='test0629' ;

attname | null_frac | n_distinct | most_common_vals | most_common_freqs

---------+-----------+------------+---------------------------------------------------------------+-------------------

id | 0 | -1 | |

col1 | 0 | 4 | {AbcdefgAbcdefg,abcdefgABCDEFG,ABCDEFGabcdefg,ABCDEFGABCDEFG} | {0.4,0.2,0.2,0.2}

col2 | 0 | 1 | {HijklmnHijklmn} | {1}

(3 rows)

MCV中显示，'ABCDEFGABCDEFG' 是0.2，所以

selectivity = mcf[4] = 0.2

rows = 1000 * 0.2 = 200

如果值不存在mvc列表中，该如何估计选择性.

mytest=# update test0629 set col1 = 'xxx' where id =666;

UPDATE 1

mytest=# update test0629 set col2 = 'yyy' where id = 888;

UPDATE 1

mytest=# SELECT attname, null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats WHERE tablename='test0629' ;

attname | null_frac | n_distinct | most_common_vals | most_common_freqs

---------+-----------+------------+---------------------------------------------------------------+-------------------

id | 0 | -1 | |

col1 | 0 | 4 | {AbcdefgAbcdefg,abcdefgABCDEFG,ABCDEFGabcdefg,ABCDEFGABCDEFG} | {0.4,0.2,0.2,0.2}

col2 | 0 | 1 | {HijklmnHijklmn} | {1}

(3 rows)

mytest=# explain select * from test0629 where col2 = 'yyy';

QUERY PLAN

----------------------------------------------------------

Seq Scan on test0629 (cost=0.00..23.50 rows=1 width=34)

Filter: ((col2)::text = 'yyy'::text)

(2 rows)

也就是说，将MCV的所有频率相加，并将它们从一个减去，然后除以其他不同的值的数目。这相当于假设不是任何MCV的列的分数均匀分布在所有其他不同的值中。

selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)

接下来看一下，where条件中多余两个的

mytest=# explain select * from test0629 where id < 100 and col1 = 'abcdefgABCDEFG';

QUERY PLAN

------------------------------------------------------------------------------

Bitmap Heap Scan on test0629 (cost=5.03..17.53 rows=20 width=34)

Recheck Cond: (id < 100)

Filter: ((col1)::text = 'abcdefgABCDEFG'::text)

-> Bitmap Index Scan on test0629_pkey (cost=0.00..5.03 rows=100 width=0)

Index Cond: (id < 100)

(5 rows)

selectivity = selectivity(id < 100) * selectivity(col1 = 'abcdefgABCDEFG')

多表join的

mytest=# create table test062902 as select * from test0629 ;

SELECT 1000

mytest=# alter table test062902 add primary key (id);

ALTER TABLE

mytest=# create index on test062902 (col1);

CREATE INDEX

mytest=# explain select * from test0629 t1, test062902 t2 where t1.id = t2.id and t1.id < 20;

QUERY PLAN

-----------------------------------------------------------------------------------------

Hash Join (cost=15.93..47.55 rows=20 width=66)

Hash Cond: (t2.id = t1.id)

-> Seq Scan on test062902 t2 (cost=0.00..19.00 rows=1000 width=32)

-> Hash (cost=15.68..15.68 rows=20 width=34)

-> Bitmap Heap Scan on test0629 t1 (cost=4.43..15.68 rows=20 width=34)

Recheck Cond: (id < 20)

-> Bitmap Index Scan on test0629_pkey (cost=0.00..4.43 rows=20 width=0)

Index Cond: (id < 20)

(8 rows)

mytest=# SELECT tablename, null_frac,n_distinct, most_common_vals,most_common_freqs FROM pg_stats

WHERE tablename IN ('test0629','test062902') AND attname='col1';

tablename | null_frac | n_distinct | most_common_vals | most_common_freqs

------------+-----------+------------+---------------------------------------------------------------+---------------------

test0629 | 0 | 4 | {AbcdefgAbcdefg,abcdefgABCDEFG,ABCDEFGabcdefg,ABCDEFGABCDEFG} | {0.4,0.2,0.2,0.2}

test062902 | 0 | 5 | {AbcdefgAbcdefg,abcdefgABCDEFG,ABCDEFGABCDEFG,ABCDEFGabcdefg} | {0.4,0.2,0.2,0.199}

(2 rows)

selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)

功能之间的依赖

mytest=# CREATE TABLE t (a INT, b INT);

CREATE TABLE

mytest=# INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);

INSERT 0 10000

mytest=# ANALYZE t;

ANALYZE

mytest=# SELECT relpages, reltuples FROM pg_class WHERE relname = 't';

relpages | reltuples

----------+-----------

45 | 10000

(1 row)

mytest=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;

QUERY PLAN

-------------------------------------------------------------------------------

Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1)

Filter: (a = 1)

Rows Removed by Filter: 9900

Planning time: 0.252 ms

Execution time: 2.644 ms

(5 rows)

mytest=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;

QUERY PLAN

-----------------------------------------------------------------------------

Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1)

Filter: ((a = 1) AND (b = 1))

Rows Removed by Filter: 9900

Planning time: 0.155 ms

Execution time: 2.688 ms

planer单独估计每个条件的选择性，达到与上面相同的1%个估计值。然后，假设条件是独立的，因此它乘以它们的选择性，产生最终的选择性估计仅为0.01%。

这是一个显著的低估，因为匹配条件（100）的实际行数是两个数量级。

这个问题可以通过创建一个统计对象来固定

mytest=# CREATE STATISTICS stts (dependencies) ON a, b FROM t;

CREATE STATISTICS

mytest=# ANALYZE t;

ANALYZE

mytest=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;

QUERY PLAN

-------------------------------------------------------------------------------

Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)

Filter: ((a = 1) AND (b = 1))

Rows Removed by Filter: 9900

Planning time: 0.241 ms

Execution time: 2.601 ms

(5 rows)

由于查询子句中使用的多个列相关联，所以通常会看到慢速查询运行坏的执行计划。planer通常假设多个条件是相互独立的，当列值相关时，假设不成立。

常规统计，由于其各自的柱性质，不能捕捉任何有关纵列相关的知识。然而，PostgreSQL具有计算多元统计的能力，它可以捕获这样的信息。

由于可能的列组合的数量非常大，因此自动计算多元统计是不实际的。相反，可以创建扩展统计对象，通常称为公正统计对象，以指示服务器在有趣的列集合上获得统计信息。

统计对象是使用创建统计数据创建的，这可以查看更多的细节。创建这样的对象只创建表示对统计有兴趣的目录条目。

实际数据收集是通过分析（手动命令或后台自动分析）来执行的。采集的数据可以在pg_statistic_ext目录检查。

分析基于计算常规单列统计所需的表行的相同样本计算扩展统计信息。由于通过增加表或其任何列的统计目标增加样本量，

较大的统计目标通常会导致更准确的扩展统计，以及花费更多的时间来计算它们。

多元化的distinct值

类似的问题发生在多个列集合的基数估计中，当GROUPBY列出单个列时，n个不同的估计（可以看作是哈希聚合节点返回的行的估计数）是非常准确的

mytest=# EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;

QUERY PLAN

-----------------------------------------------------------------------------------------

HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1)

Group Key: a

-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1)

Planning time: 0.154 ms

Execution time: 6.890 ms

(5 rows)

但是，如果没有多个统计信息，在组中具有两列的查询中的组数的估计是按数量级递减的：

mytest=# EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;

QUERY PLAN

-----------------------------------------------------------------------------------------

HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1)

Group Key: a, b

-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)

Planning time: 0.137 ms

Execution time: 7.857 ms

(5 rows)

通过重新定义统计对象以包括两个列的n个不同的计数，估计大大提高：

mytest=# DROP STATISTICS stts;

DROP STATISTICS

mytest=# CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;

CREATE STATISTICS

mytest=# ANALYZE t;

ANALYZE

mytest=# EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;

QUERY PLAN

-----------------------------------------------------------------------------------------

HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1)

Group Key: a, b

-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)

Planning time: 0.291 ms

Execution time: 7.663 ms

(5 rows)

Postgresql - 执行计划如何使用统计信息

猜你喜欢