pgxl 系列之六: ubuntu 16.04 的业务数据模拟

版权声明:本文为博主原创文章,转载请标明出处。 https://blog.csdn.net/ctypyb2002/article/details/83380507

os: ubuntu 16.04
postgres-xl: 10alpha3

ip 规划:

gtm 192.168.56.99 gtm
gtms 192.168.56.100 gtm slave

node1 192.168.56.101 gtm-proxy,coordinator,datanode
node2 192.168.56.102 gtm-proxy,coordinator,datanode
node3 192.168.56.103 gtm-proxy,coordinator,datanode
node4 192.168.56.104 gtm-proxy,coordinator,datanode

node1、node2 互为主从,node3、node4 互为主从
端口使用情况

gtm               20001        gtm slave         20001

gtm proxy         20001        gtm proxy         20001
coordinator       20004        coordinator       20005
coordinator slave 20005        coordinator slave 20004
coordinator pool  20010        coordinator pool  20011
                  20011                          20010
datanode          20008        datanode          20009
datanode slave    20009	       datanode slave    20008
datanode pool     20012        datanode pool     20013
                  20013			                 20012

模拟业务数据,做一些常规的增删改查操作。

连接 pgxl coordinator

需要修改 node1、node2、node3、node4节点所有实例的pg_hba.conf

$ vi pg_hba.conf

host    all             all             192.168.0.101/32        trust
host    all             all             192.168.0.102/32        trust
host    all             all             192.168.0.103/32        trust
host    all             all             192.168.0.104/32        trust
host    all             all             192.168.0.0/24          md5

app只能连接 coordinator,不能直连 datanode 节点,架构图参考 https://www.postgres-xl.org/overview/
在 node1 节点使用postgres登录 coordinator,创建app连接的用户。

$ psql -p 20004
psql (PGXL 10alpha2, based on PG 10beta3 (Postgres-XL 10alpha2))
Type "help" for help.

postgres=# create user peiyb with password 'rootroot';
CREATE ROLE
postgres=# create database peiybdb with owner=peiyb;
CREATE DATABASE
postgres=# 
postgres=# 
peiybdb=> \du
                                   List of roles
 Role name |                         Attributes                         | Member of 
-----------+------------------------------------------------------------+-----------
 peiyb     |                                                            | {}
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

peiybdb=> \l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 peiybdb   | peiyb    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(4 rows)


 
postgres=# \q

在 node1 节点使用peiyb登录 coordinator

$ psql -p 20004 -U peiyb peiybdb
psql (PGXL 10alpha2, based on PG 10beta3 (Postgres-XL 10alpha2))
Type "help" for help.

peiybdb=> 
peiybdb=> 

创建机构表、销量表

peiybdb=> create table dm_org(id varchar(100),name varchar(100));
CREATE TABLE
peiybdb=> create table fact_sales(day_id varchar(10),time_id varchar(10),org_id varchar(100),product_id varchar(100),qty int8,amount numeric(10,2));
CREATE TABLE
peiybdb=> \d
          List of relations
 Schema |    Name    | Type  | Owner 
--------+------------+-------+-------
 public | dm_org     | table | peiyb
 public | fact_sales | table | peiyb
(2 rows)

peiybdb=> \d+ dm_org
                                          Table "public.dm_org"
 Column |          Type          | Collation | Nullable | Default | Storage  | Stats target | Description 
--------+------------------------+-----------+----------+---------+----------+--------------+-------------
 id     | character varying(100) |           |          |         | extended |              | 
 name   | character varying(100) |           |          |         | extended |              | 
Distribute By: HASH(id)
Location Nodes: ALL DATANODES

peiybdb=> \d+ fact_sales
                                          Table "public.fact_sales"
   Column   |          Type          | Collation | Nullable | Default | Storage  | Stats target | Description 
------------+------------------------+-----------+----------+---------+----------+--------------+-------------
 day_id     | character varying(10)  |           |          |         | extended |              | 
 time_id    | character varying(10)  |           |          |         | extended |              | 
 org_id     | character varying(100) |           |          |         | extended |              | 
 product_id | character varying(100) |           |          |         | extended |              | 
 qty        | bigint                 |           |          |         | plain    |              | 
 amount     | numeric(10,2)          |           |          |         | main     |              | 
Distribute By: HASH(day_id)
Location Nodes: ALL DATANODES

可以看到,多了Distribute By 和 Location Nodes两行。

peiybdb=> \h create table

[ 
  DISTRIBUTE BY { REPLICATION | ROUNDROBIN | { [HASH | MODULO ] ( column_name ) } } |
  DISTRIBUTED { { BY ( column_name ) } | { RANDOMLY } |
  DISTSTYLE { EVEN | KEY | ALL } DISTKEY ( column_name )
]
[ TO { GROUP groupname | NODE ( nodename [, ... ] ) } ]


如果DISTRIBUTE BY 后面有如下选项:
REPLICATION,则是复制模式,其余则是分片模式
ROUNDROBIN 指的是按照轮询的方式分布数据
HASH 指的是按照指定列的哈希值分布数据
MODULO 指的是按照指定列的取摩运算分布数据

TO GROUP|NODE 指定了数据分布的节点范围,如果没有指定则默认所有数据节点参与数据分布。
如果没有指定分布模式,即使用普通的CREATE TABLE语句,PGXL会默认采用分片模式将数据分布到所有数据节点。

初始化机构表、销量表数据

peiybdb=> select * from dm_org;
  id   |  name   
-------+---------
 10001 | A经营部
 10003 | C经营部
 10004 | D经营部
 10002 | B经营部
 10005 | E经营部
 10006 | F经营部
(6 rows)

peiybdb=> insert into dm_org(id,name) values('10001','A经营部'),('10002','B经营部'),('10003','C经营部'),('10004','D经营部'),('10005','E经营部'),('10006','F经营部');

peiybdb=> insert into fact_sales(day_id,time_id,org_id,product_id,qty,amount)
          select t_day.day_id,
		         '' as time_id,
				 t_org.id,
				 ((random()*100000)::int8)::varchar as product_id,
				 t_sale.qty,
				 t_sale.amount
		    from (select to_char(generate_series('2018-10-01'::date,'2018-10-25','1 day'),'yyyymmdd') as day_id) t_day,
                 (SELECT (random()*10000)::int8 as qty, (random()*10000)::numeric(10,2) as amount FROM generate_series(1,10000)  ) as t_sale,
				 (select id,name from dm_org) t_org
		   where 1=1
		   ;

查询一:B经营部 在 20181010 的总的销售量和销售额。

peiybdb=> explain select sale.day_id,
                 org.name,
				 sum(qty),
				 sum(amount)
            from dm_org org,
			     fact_sales sale
		   where 1=1
		     and org.name='B经营部'
			 and org.id=sale.org_id
			 and sale.day_id='20181010'
		   group by sale.day_id,
                    org.name
                                                         QUERY PLAN                                                     
     
-----------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=100.00..123.95 rows=1 width=320)
   Group Key: sale.day_id, org.name
   ->  Remote Subquery Scan on all (node1,node2,node3,node4)  (cost=100.00..123.92 rows=1 width=320)
         ->  Partial GroupAggregate  (cost=0.00..23.91 rows=1 width=320)
               Group Key: sale.day_id, org.name
               ->  Nested Loop  (cost=0.00..23.89 rows=1 width=280)
                     Join Filter: ((org.id)::text = (sale.org_id)::text)
                     ->  Remote Subquery Scan on all (node1,node2,node3,node4)  (cost=100.00..112.57 rows=1 width=436)
                           Distribute results by H: id
                           ->  Seq Scan on dm_org org  (cost=0.00..12.12 rows=1 width=436)
                                 Filter: ((name)::text = 'B经营部'::text)
                     ->  Materialize  (cost=100.00..112.04 rows=1 width=280)
                           ->  Remote Subquery Scan on all (node1,node2,node3,node4)  (cost=100.00..112.03 rows=1 width=280)
                                 Distribute results by H: org_id
                                 ->  Seq Scan on fact_sales sale  (cost=0.00..11.75 rows=1 width=280)
								 
peiybdb=> select sale.day_id,
                 org.name,
				 sum(qty),
				 sum(amount)
            from dm_org org,
			     fact_sales sale
		   where 1=1
		     and org.name='B经营部'
			 and org.id=sale.org_id
			 and sale.day_id='20181010'
		   group by sale.day_id,
                    org.name	
  day_id  |  name   |   sum    |     sum     
----------+---------+----------+-------------
 20181010 | B经营部 | 49843700 | 49806627.37
(1 row)					

执行计划中出现的 Remote Subquery Scan on all (node1,node2,node3,node4),表明在sql在所有节点都执行了。

登录任何一个 coordinator ,都视为登录同一台 postgresql database cluster。

另外:
维度表一般都是小表,可以在每个节点均保存完整数据,在查询时效率更高。
比如 create table dm_org1(id varchar(100),name varchar(100)) DISTRIBUTE BY REPLICATION ;

查询一 中使用 dm_org、dm_org1对比,前者需要 Time: 3416.255 ms (00:03.416),后者只需要 Time: 485.845 ms,差异还是比较明显。

事实表从一开始是就应该设计成分片,确保在每个节点的数据量尽可能相同。

参考:
https://www.postgres-xl.org/documentation/
https://www.postgres-xl.org/documentation/sql-createtable.html

猜你喜欢

转载自blog.csdn.net/ctypyb2002/article/details/83380507