os: ubuntu 16.04
db: postgresql 9.6.8
pg_log_userqueries: 1.1.1
bloom提供了一种基于布鲁姆过滤器的索引访问方法。
布鲁姆过滤器是一种空间高效的数据结构,它被用来测试一个元素是否为一个集合的成员。
在索引访问方法的情况下,它可以通过尺寸在索引创建时决定的签名来快速地排除不匹配的元组。
当表具有很多属性并且查询可能会测试其中任意组合时,这种类型的索引最有用。
传统的 btree 索引比布鲁姆索引更快,但是需要很多 btree 索引来支持所有可能的查询,而对于布鲁姆索引来说只需要一个即可。
不过要注意 bloom 索引只支持等值查询,而 btree 索引还能执行不等和范围搜索。
版本
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.5 LTS
Release: 16.04
Codename: xenial
$ psql
psql (9.6.8)
Type "help" for help.
postgres=# select version();
version
----------------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.8 on x86_64-pc-linux-gnu (Ubuntu 9.6.8-1.pgdg16.04+1), compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
(1 row)
postgres=#
安装
bloom 作为 extension 是包含在 postgresql-contrib 里的。只要安装了该包,肯定存在相关的文件
$ dpkg -l |grep -i postgresql |grep -i contrib
ii postgresql-contrib-9.6 9.6.8-1.pgdg16.04+1 amd64 additional facilities for PostgreSQL
$ ls -l /usr/lib/postgresql/9.6/lib |grep -i bloom
-rw-r--r-- 1 root root 26464 Feb 27 2018 bloom.so
$ ls -l /usr/share/postgresql/9.6/extension |grep -i bloom
-rw-r--r-- 1 root root 666 Feb 27 2018 bloom--1.0.sql
-rw-r--r-- 1 root root 156 Feb 27 2018 bloom.control
实验
$ vi /etc/postgresql/9.6/main/postgresql.conf
shared_preload_libraries = 'bloom,pg_stat_statements'
$ sudo /etc/init.d/postgresql restart
$ psql
psql (9.6.8)
Type "help" for help.
postgres=#
postgres=# select *
from pg_available_extensions e
where 1=1
and e.name like '%bloom%'
;
postgres=# select *
from pg_extension pe
where 1=1
and pe.extname like '%bloom%'
;
postgres=# create extension bloom;
postgres=# \dx
List of installed extensions
Name | Version | Schema | Description
----------+---------+------------+--------------------------------------------------
bloom | 1.0 | public | bloom access method - signature file based index
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)
postgres=# select * from pg_am;
amname | amhandler | amtype
--------+-------------+--------
btree | bthandler | i
hash | hashhandler | i
gist | gisthandler | i
gin | ginhandler | i
spgist | spghandler | i
brin | brinhandler | i
bloom | blhandler | i
(7 rows)
pg_am 也有了 bloom 方法。
postgres=# CREATE TABLE tbloom0 AS
SELECT
(random() * 1000000)::int as i1,
(random() * 1000000)::int as i2,
(random() * 1000000)::int as i3,
(random() * 1000000)::int as i4,
(random() * 1000000)::int as i5,
(random() * 1000000)::int as i6
FROM
generate_series(1,10000000);
postgres=# CREATE TABLE tbloom1 AS
SELECT
(random() * 1000000)::int as i1,
(random() * 1000000)::int as i2,
(random() * 1000000)::int as i3,
(random() * 1000000)::int as i4,
(random() * 1000000)::int as i5,
(random() * 1000000)::int as i6
FROM
generate_series(1,10000000);
postgres=# CREATE TABLE tbloom2 AS
SELECT
(random() * 1000000)::int as i1,
(random() * 1000000)::int as i2,
(random() * 1000000)::int as i3,
(random() * 1000000)::int as i4,
(random() * 1000000)::int as i5,
(random() * 1000000)::int as i6
FROM
generate_series(1,10000000);
postgres=# \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+----------------------+---------------+----------+------------+-------------
public | tbloom0 | table | postgres | 498 MB |
public | tbloom1 | table | postgres | 498 MB |
public | tbloom2 | table | postgres | 498 MB |
(3 rows)
postgres=# CREATE INDEX bloomidx ON tbloom1 USING bloom (i1, i2, i3, i4, i5, i6);
postgres=# CREATE index btreeidx ON tbloom2 (i1, i2, i3, i4, i5, i6);
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+----------------------+-------+----------+-----------------+---------+-------------
public | bloomidx | index | postgres | tbloom1 | 153 MB |
public | btreeidx | index | postgres | tbloom2 | 387 MB |
(2 rows)
可以看到 btree 比 bloom 索引要大很多
查看一些语句执行计划
全表扫描
postgres=# EXPLAIN ANALYZE VERBOSE SELECT * FROM tbloom0 WHERE i2 = 898732 AND i5 = 123451;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on public.tbloom0 (cost=0.00..213695.00 rows=1 width=24) (actual time=874.166..874.166 rows=0 loops=1)
Output: i1, i2, i3, i4, i5, i6
Filter: ((tbloom0.i2 = 898732) AND (tbloom0.i5 = 123451))
Rows Removed by Filter: 10000000
Planning time: 0.554 ms
Execution time: 874.234 ms
(6 rows)
Time: 901.610 ms
bloom 索引
postgres=# EXPLAIN ANALYZE VERBOSE SELECT * FROM tbloom1 WHERE i2 = 898732 AND i5 = 123451;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.tbloom1 (cost=178436.00..178440.02 rows=1 width=24) (actual time=75.664..75.664 rows=0 loops=1)
Output: i1, i2, i3, i4, i5, i6
Recheck Cond: ((tbloom1.i2 = 898732) AND (tbloom1.i5 = 123451))
Rows Removed by Index Recheck: 2362
Heap Blocks: exact=2327
-> Bitmap Index Scan on bloomidx (cost=0.00..178436.00 rows=1 width=0) (actual time=56.799..56.799 rows=2362 loops=1)
Index Cond: ((tbloom1.i2 = 898732) AND (tbloom1.i5 = 123451))
Planning time: 0.342 ms
Execution time: 75.736 ms
(9 rows)
Time: 76.552 ms
btree 索引
汗,居然没走btree索引
postgres=# EXPLAIN ANALYZE VERBOSE SELECT * FROM tbloom2 WHERE i2 = 898732 AND i5 = 123451;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on public.tbloom2 (cost=0.00..213695.00 rows=1 width=24) (actual time=909.765..909.765 rows=0 loops=1)
Output: i1, i2, i3, i4, i5, i6
Filter: ((tbloom2.i2 = 898732) AND (tbloom2.i5 = 123451))
Rows Removed by Filter: 10000000
Planning time: 0.095 ms
Execution time: 909.785 ms
(6 rows)
Time: 910.325 ms
postgres=# set random_page_cost = 2;
SET
Time: 0.229 ms
postgres=# EXPLAIN ANALYZE VERBOSE SELECT * FROM tbloom2 WHERE i2 = 898732 AND i5 = 123451;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using btreeidx on public.tbloom2 (cost=0.56..199158.57 rows=1 width=24) (actual time=290.799..290.799 rows=0 loops=1)
Output: i1, i2, i3, i4, i5, i6
Index Cond: ((tbloom2.i2 = 898732) AND (tbloom2.i5 = 123451))
Heap Fetches: 0
Planning time: 0.081 ms
Execution time: 290.822 ms
(6 rows)
Time: 291.322 ms
可以,使用bloom 索引获取要快很多。
btree 搜索的主要问题是,当搜索条件不约束前几个索引列时,btree 的效率不好。
对于 btree 更好的策略是在每一列上创建一个独立的索引。那么规划器将选择多索引的执行计划,但是创建多个索引的代价又变得很大。
random_page_cost 值得设置非常积极的影响优化器对索引的选择,默认值是 4,对于io较高的环境(比如ssd),可以调整为2。
参考:
http://postgres.cn/docs/9.6/bloom.html
https://en.wikipedia.org/wiki/Bloom_filter