ClickHouse and his friends (13) ReplicatedMergeTree table engine and synchronization mechanism

Original source: https://bohutang.me/2020/09/13/clickhouse-and-friends-replicated-merge-tree/

Last Update: 2020-09-13

In MySQL, in order to ensure high availability and data security, a master-slave mode is adopted, and data is synchronized through binlog.

In ClickHouse, we can use the ReplicatedMergeTree engine, and data synchronization is done through zookeeper.

This article starts with building a multi-replica cluster, and then has a glimpse of the underlying mechanism and a simple bite.

1. Cluster construction

Set up a 2 replica test cluster. Due to limited conditions, clickhouse-server (2 replicas) + zookeeper (1) is set up on the same physical machine. In order to avoid port conflicts, the two replica ports will be different.

1.1 zookeeper

docker run  -p 2181:2181 --name some-zookeeper --restart always -d zookeeper

1.2 replica cluster

replica-1 config.xml:

   <zookeeper>
      <node index="1">
         <host>172.17.0.2</host>
         <port>2181</port>
      </node>
   </zookeeper>

   <remote_servers>
      <mycluster_1>
         <shard_1>
            <internal_replication>true</internal_replication>
            <replica>
               <host>s1</host>
               <port>9000</port>
            </replica>
            <replica>
               <host>s2</host>
               <port>9001</port>
            </replica>
         </shard_1>
      </mycluster_1>
   </remote_servers>

   <macros>
      <cluster>mycluster_1</cluster>
      <shard>1</shard>
      <replica>s1</replica>
   </macros>


   <tcp_port>9101</tcp_port>
   <interserver_http_port>9009</interserver_http_port>
   <path>/cluster/d1/datas/</path>

replica-2 config.xml:

   <zookeeper>
      <node index="1">
         <host>172.17.0.2</host>
         <port>2181</port>
      </node>
   </zookeeper>

   <remote_servers>
      <mycluster_1>
         <shard_1>
            <internal_replication>true</internal_replication>
            <replica>
               <host>s1</host>
               <port>9000</port>
            </replica>
            <replica>
               <host>s2</host>
               <port>9001</port>
            </replica>
         </shard_1>
      </mycluster_1>
   </remote_servers>

   <macros>
      <cluster>mycluster_1</cluster>
      <shard>1</shard>
      <replica>s2</replica>
   </macros>

   <tcp_port>9102</tcp_port>
   <interserver_http_port>9010</interserver_http_port>
   <path>/cluster/d2/datas/</path>

1.3 Create a test table

CREATE TABLE default.rtest1 ON CLUSTER 'mycluster_1'
(
    `id` Int64,
    `p` Int16
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/replicated/test', '{replica}')
PARTITION BY p
ORDER BY id

1.4 View zookeeper

docker exec -it some-zookeeper bash
./bin/zkCli.sh

[zk: localhost:2181(CONNECTED) 17] ls /clickhouse/tables/replicated/test/replicas
[s1, s2]

Both replicas have been registered to zookeeper.

2. Synchronization principle 

If a write is performed on replica-1:

replica-1> INSERT INTO rtest VALUES(33,33);

How is the data synchronized to replica-2?

s1.  replica-1> StorageReplicatedMergeTree::write --> ReplicatedMergeTreeBlockOutputStream::write(const Block & block)
s2.  replica-1> storage.writer.writeTempPart,写入本地磁盘
s3.  replica-1> ReplicatedMergeTreeBlockOutputStream::commitPart
s4.  replica-1> StorageReplicatedMergeTree::getCommitPartOp,提交LogEntry到zookeeper,信息包括:
    ReplicatedMergeTreeLogEntry {
     type: GET_PART,
     source_replica: replica-1,
     new_part_name: part->name,
     new_part_type: part->getType
    }
s5.  replica-1> zkutil::makeCreateRequest(zookeeper_path + "/log/log-0000000022"),更新log_pointer到zookeeper

s6.  replica-2> StorageReplicatedMergeTree::queueUpdatingTask(),定时pull任务
s7.  replica-2> ReplicatedMergeTreeQueue::pullLogsToQueue ,拉取
s8.  replica-2> zookeeper->get(replica_path + "/log_pointer") ,向zookeeper获取当前replica已经同步的位点
s9.  replica-2> zookeeper->getChildrenWatch(zookeeper_path + "/log") ,向zookeeper获取所有的LogEntry信息
s10. replica-2> 根据同步位点log_pointer从所有LogEntry中筛选需要同步的LogEntry,写到queue
s11. replica-2> StorageReplicatedMergeTree::queueTask,消费queue任务
s12. replica-2> StorageReplicatedMergeTree::executeLogEntry(LogEntry & entry),根据LogEntry type执行消费
s13. replica-2> StorageReplicatedMergeTree::executeFetch(LogEntry & entry) 
s14. replica-2> StorageReplicatedMergeTree::fetchPart,从replica-1的interserver_http_port下载part目录数据
s15. replica-2> MergeTreeData::renameTempPartAndReplace,把文件写入本地并更新内存meta信息
s16. replica-2> 数据同步完成

You can also enter the zookeeper docker to directly view a LogEntry:

[zk: localhost:2181(CONNECTED) 85] get /clickhouse/tables/replicated/test/log/log-0000000022
format version: 4
create_time: 2020-09-13 16:39:05
source replica: s1
block_id: 33_2673203974107464807_7670041793554220344
get
33_2_2_0

3. Summary

This article takes writing as an example, analyzes the working principle of ClickHouse ReplicatedMergeTree from the bottom, and the logic is not complicated.

Data synchronization of different replicas requires zookeeper (someone in the community is currently doing etcd integration pr#10376 (https://github.com/ClickHouse/ClickHouse/pull/10376)) for metadata coordination, which is a subscription/consumption model, involving The specific data directory also needs to go to the corresponding replica to download through the interserver_http_port port.

The synchronization of replicas is based on the file directory, which brings an advantage: we can easily realize the separation of ClickHouse storage and calculation, multiple clickhouse-servers can mount the same data for calculation at the same time, and each node of these servers It is writable. Brother Hu has implemented a prototype that can work. For details, please refer to the next article < Storage and Computing Separation Scheme and Implementation >.

4. Reference

[1] StorageReplicatedMergeTree.cpp (https://github.com/ClickHouse/ClickHouse/blob/f37814b36754bf11b52bd9c77d0e15f4d1825033/src/Storages/StorageReplicatedMergeTree.cpp)

[2] ReplicatedMergeTreeBlockOutputStream.cpp (https://github.com/ClickHouse/ClickHouse/blob/f37814b36754bf11b52bd9c77d0e15f4d1825033/src/Storages/MergeTree/ReplicatedMut)

[3] ReplicatedMergeTreeLogEntry.cpp (https://github.com/ClickHouse/ClickHouse/blob/f37814b36754bf11b52bd9c77d0e15f4d1825033/src/Storages/MergeTree/ReplicatedMergeT

[4] ReplicatedMergeTreeQueue.cpp (https://github.com/ClickHouse/ClickHouse/blob/f37814b36754bf11b52bd9c77d0e15f4d1825033/src/Storages/MergeTree/ReplicatedMergeTree)

The full text is over.

Enjoy ClickHouse :)

Teacher Ye's "MySQL Core Optimization" class has been upgraded to MySQL 8.0, scan the code to start the journey of MySQL 8.0 practice

Guess you like

Origin blog.csdn.net/n88Lpo/article/details/111771419