And some students use clickhouse talked down, many small and medium sized companies are still using standalone version, this was unexpected; probably really because clickhouse second day of the second performance, plus the amount of data, data recovery cost is not high, just out this policy;
Into the topic: today attempt to set up a cluster clickhouse 2 * 2, two slices, each slice has two copies, a total of four nodes, the logical topology is as follows
host | shard | replica |
---|---|---|
hadoop2 | 01 | 01 |
hadoop5 | 01 | 02 |
hadoop6 | 02 | 01 |
hadoop7 | 02 | 02 |
1. Installation clickhouse-server
All nodes installed clickhouse-server, reference clickhouse install on centos
2. modify the configuration config.xml
It involves three parts remote_servers
, zookeeper
, macros
, of all nodes remote_servers
, zookeeper
are the same, except that macros
, for each node and modify the values replica shard according to their roles; this node is given below configured hadoop2
<remote_servers incl="clickhouse_remote_servers">
<perftest_2shards_2replicas>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>hadoop2</host>
<port>9666</port>
</replica>
<replica>
<host>hadoop5</host>
<port>9666</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>hadoop6</host>
<port>9666</port>
</replica>
<replica>
<host>hadoop7</host>
<port>9666</port>
</replica>
</shard>
</perftest_2shards_2replicas>
</remote_servers>
<zookeeper>
<node index="1">
<host>hadoop1</host>
<port>2181</port>
</node>
<node index="2">
<host>hadoop2</host>
<port>2181</port>
</node>
<node index="3">
<host>hadoop3</host>
<port>2181</port>
</node>
<node index="4">
<host>hadoop4</host>
<port>2181</port>
</node>
<node index="5">
<host>hadoop5</host>
<port>2181</port>
</node>
</zookeeper>
<macros>
<shard>01</shard>
<replica>01</replica>
</macros>
Start all clickhouse-server
service clickhouse-server start
3. The construction of the table
Creating t_s2_r2 at each node, without having to manually replace shard and replica, built the table when it will register themselves in the zookeeper according shard and replica data
CREATE TABLE t_s2_r2\
(\
dt Date,\
path String \
)\
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/t_s2_r2','{replica}',dt, dt, 8192)
Create a distributed table t_s2_r2_all
, the table at any one node to create anything, t_s2_r2_all
like a view, pointing to all fragments, the real data stored in each node of the t_s2_r2
table
CREATE TABLE t_s2_r2_all AS t_s2_r2 ENGINE = Distributed(perftest_2shards_2replicas, default, t_s2_r2, rand())
Insert data
insert into t_s2_r2_all values('2019-07-21','path1')
insert into t_s2_r2_all values('2019-07-22','path1')
insert into t_s2_r2_all values('2019-07-23','path1')
insert into t_s2_r2_all values('2019-07-23','path1')
View data
hadoop7 :) select * from t_s2_r2_all
SELECT *
FROM t_s2_r2_all
┌─────────dt─┬─path──┐
│ 2019-07-21 │ path1 │
│ 2019-07-22 │ path1 │
│ 2019-07-24 │ path1 │
└────────────┴───────┘
┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘
4 rows in set. Elapsed: 0.009 sec.
Check each table node t_s2_r2
data
hadoop2 :) select * from t_s2_r2
SELECT *
FROM t_s2_r2
┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘
hadoop5 :) select * from t_s2_r2
SELECT *
FROM t_s2_r2
┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘
1 rows in set. Elapsed: 0.007 sec.
hadoop6 :) select * from t_s2_r2
SELECT *
FROM t_s2_r2
┌─────────dt─┬─path──┐
│ 2019-07-21 │ path1 │
│ 2019-07-22 │ path1 │
│ 2019-07-24 │ path1 │
└────────────┴───────┘
3 rows in set. Elapsed: 0.002 sec.
hadoop7 :) select * from t_s2_r2
SELECT *
FROM t_s2_r2
┌─────────dt─┬─path──┐
│ 2019-07-21 │ path1 │
│ 2019-07-22 │ path1 │
│ 2019-07-24 │ path1 │
└────────────┴───────┘
3 rows in set. Elapsed: 0.002 sec.
You can see hadoop2 and hadoop5 data are consistent, hadoop6 and hadoop7 data are consistent, high-availability now under test, kill hadoop2 node
service clickhouse-server stop
t_s2_r2_all
Data still to be investigated, because there is a surviving copy of hadoop5 shard01
hadoop7 :) select * from t_s2_r2_all
SELECT *
FROM t_s2_r2_all
┌─────────dt─┬─path──┐
│ 2019-07-21 │ path1 │
│ 2019-07-22 │ path1 │
│ 2019-07-24 │ path1 │
└────────────┴───────┘
┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘
4 rows in set. Elapsed: 0.008 sec.
And still can insert data
insert into t_s2_r2_all values('2019-07-29','path2')
This data falls on shard01
hadoop5 :) select * from t_s2_r2
SELECT *
FROM t_s2_r2
┌─────────dt─┬─path──┐
│ 2019-07-29 │ path2 │
└────────────┴───────┘
┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘
2 rows in set. Elapsed: 0.002 sec.
Now restart on hadoop2 clickhouse-server nodes, just inserted data is automatically synchronized to the hadoop2 t_s2_r2
in
hadoop2 :) select * from t_s2_r2
SELECT *
FROM t_s2_r2
┌─────────dt─┬─path──┐
│ 2019-07-29 │ path2 │
└────────────┴───────┘
┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘
2 rows in set. Elapsed: 0.003 sec.
When clickhouse-server on hadoop2 and hadoop5 simultaneously kill, it t_s2_r2_all
is not available, this should be well understood, there is no longer measured up
End