os: ubuntu 16.04
postgres-xl: 10alpha3
ip 规划:
gtm 192.168.56.99 gtm
gtms 192.168.56.100 gtm slave
node1 192.168.56.101 gtm-proxy,coordinator,datanode
node2 192.168.56.102 gtm-proxy,coordinator,datanode
node3 192.168.56.103 gtm-proxy,coordinator,datanode
node4 192.168.56.104 gtm-proxy,coordinator,datanode
node1、node2 互为主从,node3、node4 互为主从
端口使用情况
gtm 20001 gtm slave 20001
gtm proxy 20001 gtm proxy 20001
coordinator 20004 coordinator 20005
coordinator slave 20005 coordinator slave 20004
coordinator pool 20010 coordinator pool 20011
20011 20010
datanode 20008 datanode 20009
datanode slave 20009 datanode slave 20008
datanode pool 20012 datanode pool 20013
20013 20012
模拟业务数据,做一些常规的增删改查操作。
连接 pgxl coordinator
需要修改 node1、node2、node3、node4节点所有实例的pg_hba.conf
$ vi pg_hba.conf
host all all 192.168.0.101/32 trust
host all all 192.168.0.102/32 trust
host all all 192.168.0.103/32 trust
host all all 192.168.0.104/32 trust
host all all 192.168.0.0/24 md5
app只能连接 coordinator,不能直连 datanode 节点,架构图参考 https://www.postgres-xl.org/overview/
在 node1 节点使用postgres登录 coordinator,创建app连接的用户。
$ psql -p 20004
psql (PGXL 10alpha2, based on PG 10beta3 (Postgres-XL 10alpha2))
Type "help" for help.
postgres=# create user peiyb with password 'rootroot';
CREATE ROLE
postgres=# create database peiybdb with owner=peiyb;
CREATE DATABASE
postgres=#
postgres=#
peiybdb=> \du
List of roles
Role name | Attributes | Member of
-----------+------------------------------------------------------------+-----------
peiyb | | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
peiybdb=> \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
peiybdb | peiyb | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
(4 rows)
postgres=# \q
在 node1 节点使用peiyb登录 coordinator
$ psql -p 20004 -U peiyb peiybdb
psql (PGXL 10alpha2, based on PG 10beta3 (Postgres-XL 10alpha2))
Type "help" for help.
peiybdb=>
peiybdb=>
创建机构表、销量表
peiybdb=> create table dm_org(id varchar(100),name varchar(100));
CREATE TABLE
peiybdb=> create table fact_sales(day_id varchar(10),time_id varchar(10),org_id varchar(100),product_id varchar(100),qty int8,amount numeric(10,2));
CREATE TABLE
peiybdb=> \d
List of relations
Schema | Name | Type | Owner
--------+------------+-------+-------
public | dm_org | table | peiyb
public | fact_sales | table | peiyb
(2 rows)
peiybdb=> \d+ dm_org
Table "public.dm_org"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+------------------------+-----------+----------+---------+----------+--------------+-------------
id | character varying(100) | | | | extended | |
name | character varying(100) | | | | extended | |
Distribute By: HASH(id)
Location Nodes: ALL DATANODES
peiybdb=> \d+ fact_sales
Table "public.fact_sales"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
------------+------------------------+-----------+----------+---------+----------+--------------+-------------
day_id | character varying(10) | | | | extended | |
time_id | character varying(10) | | | | extended | |
org_id | character varying(100) | | | | extended | |
product_id | character varying(100) | | | | extended | |
qty | bigint | | | | plain | |
amount | numeric(10,2) | | | | main | |
Distribute By: HASH(day_id)
Location Nodes: ALL DATANODES
可以看到,多了Distribute By 和 Location Nodes两行。
peiybdb=> \h create table
[
DISTRIBUTE BY { REPLICATION | ROUNDROBIN | { [HASH | MODULO ] ( column_name ) } } |
DISTRIBUTED { { BY ( column_name ) } | { RANDOMLY } |
DISTSTYLE { EVEN | KEY | ALL } DISTKEY ( column_name )
]
[ TO { GROUP groupname | NODE ( nodename [, ... ] ) } ]
如果DISTRIBUTE BY 后面有如下选项:
REPLICATION,则是复制模式,其余则是分片模式
ROUNDROBIN 指的是按照轮询的方式分布数据
HASH 指的是按照指定列的哈希值分布数据
MODULO 指的是按照指定列的取摩运算分布数据
TO GROUP|NODE 指定了数据分布的节点范围,如果没有指定则默认所有数据节点参与数据分布。
如果没有指定分布模式,即使用普通的CREATE TABLE语句,PGXL会默认采用分片模式将数据分布到所有数据节点。
初始化机构表、销量表数据
peiybdb=> select * from dm_org;
id | name
-------+---------
10001 | A经营部
10003 | C经营部
10004 | D经营部
10002 | B经营部
10005 | E经营部
10006 | F经营部
(6 rows)
peiybdb=> insert into dm_org(id,name) values('10001','A经营部'),('10002','B经营部'),('10003','C经营部'),('10004','D经营部'),('10005','E经营部'),('10006','F经营部');
peiybdb=> insert into fact_sales(day_id,time_id,org_id,product_id,qty,amount)
select t_day.day_id,
'' as time_id,
t_org.id,
((random()*100000)::int8)::varchar as product_id,
t_sale.qty,
t_sale.amount
from (select to_char(generate_series('2018-10-01'::date,'2018-10-25','1 day'),'yyyymmdd') as day_id) t_day,
(SELECT (random()*10000)::int8 as qty, (random()*10000)::numeric(10,2) as amount FROM generate_series(1,10000) ) as t_sale,
(select id,name from dm_org) t_org
where 1=1
;
查询一:B经营部 在 20181010 的总的销售量和销售额。
peiybdb=> explain select sale.day_id,
org.name,
sum(qty),
sum(amount)
from dm_org org,
fact_sales sale
where 1=1
and org.name='B经营部'
and org.id=sale.org_id
and sale.day_id='20181010'
group by sale.day_id,
org.name
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=100.00..123.95 rows=1 width=320)
Group Key: sale.day_id, org.name
-> Remote Subquery Scan on all (node1,node2,node3,node4) (cost=100.00..123.92 rows=1 width=320)
-> Partial GroupAggregate (cost=0.00..23.91 rows=1 width=320)
Group Key: sale.day_id, org.name
-> Nested Loop (cost=0.00..23.89 rows=1 width=280)
Join Filter: ((org.id)::text = (sale.org_id)::text)
-> Remote Subquery Scan on all (node1,node2,node3,node4) (cost=100.00..112.57 rows=1 width=436)
Distribute results by H: id
-> Seq Scan on dm_org org (cost=0.00..12.12 rows=1 width=436)
Filter: ((name)::text = 'B经营部'::text)
-> Materialize (cost=100.00..112.04 rows=1 width=280)
-> Remote Subquery Scan on all (node1,node2,node3,node4) (cost=100.00..112.03 rows=1 width=280)
Distribute results by H: org_id
-> Seq Scan on fact_sales sale (cost=0.00..11.75 rows=1 width=280)
peiybdb=> select sale.day_id,
org.name,
sum(qty),
sum(amount)
from dm_org org,
fact_sales sale
where 1=1
and org.name='B经营部'
and org.id=sale.org_id
and sale.day_id='20181010'
group by sale.day_id,
org.name
day_id | name | sum | sum
----------+---------+----------+-------------
20181010 | B经营部 | 49843700 | 49806627.37
(1 row)
执行计划中出现的 Remote Subquery Scan on all (node1,node2,node3,node4),表明在sql在所有节点都执行了。
登录任何一个 coordinator ,都视为登录同一台 postgresql database cluster。
另外:
维度表一般都是小表,可以在每个节点均保存完整数据,在查询时效率更高。
比如 create table dm_org1(id varchar(100),name varchar(100)) DISTRIBUTE BY REPLICATION ;
查询一 中使用 dm_org、dm_org1对比,前者需要 Time: 3416.255 ms (00:03.416),后者只需要 Time: 485.845 ms,差异还是比较明显。
事实表从一开始是就应该设计成分片,确保在每个节点的数据量尽可能相同。
参考:
https://www.postgres-xl.org/documentation/
https://www.postgres-xl.org/documentation/sql-createtable.html