ClickHouse HA Cluster

And some students use clickhouse talked down, many small and medium sized companies are still using standalone version, this was unexpected; probably really because clickhouse second day of the second performance, plus the amount of data, data recovery cost is not high, just out this policy;

Into the topic: today attempt to set up a cluster clickhouse 2 * 2, two slices, each slice has two copies, a total of four nodes, the logical topology is as follows

host	shard	replica
hadoop2	01	01
hadoop5	01	02
hadoop6	02	01
hadoop7	02	02

1. Installation clickhouse-server

All nodes installed clickhouse-server, reference clickhouse install on centos

2. modify the configuration config.xml

It involves three parts remote_servers, zookeeper, macros, of all nodes remote_servers, zookeeperare the same, except that macros, for each node and modify the values replica shard according to their roles; this node is given below configured hadoop2

 <remote_servers incl="clickhouse_remote_servers">
        <perftest_2shards_2replicas>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>hadoop2</host>
                    <port>9666</port>
                </replica>
                <replica>
                    <host>hadoop5</host>
                    <port>9666</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>hadoop6</host>
                    <port>9666</port>
                </replica>
                <replica>
                    <host>hadoop7</host>
                    <port>9666</port>
                </replica>
            </shard>
        </perftest_2shards_2replicas>
    </remote_servers>


    <zookeeper>
        <node index="1">
            <host>hadoop1</host>
            <port>2181</port>
        </node>
        <node index="2">
            <host>hadoop2</host>
            <port>2181</port>
        </node>
        <node index="3">
            <host>hadoop3</host>
            <port>2181</port>
        </node>
        <node index="4">
            <host>hadoop4</host>
            <port>2181</port>
        </node>
        <node index="5">
            <host>hadoop5</host>
            <port>2181</port>
        </node>

    </zookeeper>

    <macros>
        <shard>01</shard>
        <replica>01</replica>
    </macros>

Start all clickhouse-server

service clickhouse-server start

3. The construction of the table

Creating t_s2_r2 at each node, without having to manually replace shard and replica, built the table when it will register themselves in the zookeeper according shard and replica data

CREATE TABLE t_s2_r2\
(\
    dt Date,\
    path String \
)\
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/t_s2_r2','{replica}',dt, dt, 8192)

Create a distributed table t_s2_r2_all, the table at any one node to create anything, t_s2_r2_alllike a view, pointing to all fragments, the real data stored in each node of the t_s2_r2table

CREATE TABLE t_s2_r2_all AS t_s2_r2 ENGINE = Distributed(perftest_2shards_2replicas, default, t_s2_r2, rand())

Insert data

insert into t_s2_r2_all values('2019-07-21','path1')
insert into t_s2_r2_all values('2019-07-22','path1')
insert into t_s2_r2_all values('2019-07-23','path1')
insert into t_s2_r2_all values('2019-07-23','path1')

View data

hadoop7 :) select * from t_s2_r2_all

SELECT *
FROM t_s2_r2_all

┌─────────dt─┬─path──┐
│ 2019-07-21 │ path1 │
│ 2019-07-22 │ path1 │
│ 2019-07-24 │ path1 │
└────────────┴───────┘
┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘

4 rows in set. Elapsed: 0.009 sec.

Check each table node t_s2_r2data

hadoop2 :) select * from t_s2_r2

SELECT *
FROM t_s2_r2

┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘

hadoop5 :) select * from t_s2_r2

SELECT *
FROM t_s2_r2

┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘

1 rows in set. Elapsed: 0.007 sec.

hadoop6 :) select * from t_s2_r2

SELECT *
FROM t_s2_r2

┌─────────dt─┬─path──┐
│ 2019-07-21 │ path1 │
│ 2019-07-22 │ path1 │
│ 2019-07-24 │ path1 │
└────────────┴───────┘

3 rows in set. Elapsed: 0.002 sec.

hadoop7 :) select * from t_s2_r2

SELECT *
FROM t_s2_r2

┌─────────dt─┬─path──┐
│ 2019-07-21 │ path1 │
│ 2019-07-22 │ path1 │
│ 2019-07-24 │ path1 │
└────────────┴───────┘

3 rows in set. Elapsed: 0.002 sec.

You can see hadoop2 and hadoop5 data are consistent, hadoop6 and hadoop7 data are consistent, high-availability now under test, kill hadoop2 node
service clickhouse-server stop

t_s2_r2_allData still to be investigated, because there is a surviving copy of hadoop5 shard01

hadoop7 :) select * from t_s2_r2_all

SELECT *
FROM t_s2_r2_all

┌─────────dt─┬─path──┐
│ 2019-07-21 │ path1 │
│ 2019-07-22 │ path1 │
│ 2019-07-24 │ path1 │
└────────────┴───────┘
┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘

4 rows in set. Elapsed: 0.008 sec.

And still can insert data

insert into t_s2_r2_all values('2019-07-29','path2')

This data falls on shard01

hadoop5 :) select * from t_s2_r2

SELECT *
FROM t_s2_r2

┌─────────dt─┬─path──┐
│ 2019-07-29 │ path2 │
└────────────┴───────┘
┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘

2 rows in set. Elapsed: 0.002 sec.

Now restart on hadoop2 clickhouse-server nodes, just inserted data is automatically synchronized to the hadoop2 t_s2_r2in

hadoop2 :) select * from t_s2_r2

SELECT *
FROM t_s2_r2

┌─────────dt─┬─path──┐
│ 2019-07-29 │ path2 │
└────────────┴───────┘
┌─────────dt─┬─path──┐
│ 2019-07-23 │ path1 │
└────────────┴───────┘

2 rows in set. Elapsed: 0.003 sec.

When clickhouse-server on hadoop2 and hadoop5 simultaneously kill, it t_s2_r2_allis not available, this should be well understood, there is no longer measured up

End

woloqun

Published 118 original articles · won praise 37 · views 170 000 +

Private letter concerns