6. Copy and Sharding of ClickHouse

10. Replicas and fragments

10.1 Replicas and shards

​ 1. Data level distinction:

​ Assume that there are two nodes in the CK cluster, host1 and host2. Both nodes have a table with the same structure.

​ At this time, if the data in the table of host1 is different from the data in the table of host2, it is fragmentation;

​ At this time, if the data in the table of host1 and the data in the table of host2 are the same, it is a copy.

​ So apart from the difference in table engines, from a purely data perspective, there is sometimes only a thin line between replicas and shards.

​ 2. Differentiating the function level:

​ Copy: Prevent data loss and increase the redundancy of data storage;

​ Fragmentation: Realize the horizontal segmentation of data and improve the performance of data writing and reading.

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-npRtuHnD-1610932819138)(C:\Users\12942\AppData\Roaming\Typora\typora-user-images\ image-20210112074852130.png)]

10.2 Data copy

​ The prefix Replicated is added to the front of MergeTree, and it can be combined into a new variant engine, that is, Replicated-MergeTree replicated table.

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-kKDM0L1X-1610932819138) (C:\Users\12942\AppData\Roaming\Typora\typora-user-images\ image-20210112075924142.png)]

​ Only by using the ReplicatedMergeTree replication table series engine, can the ability of replicating be applied. That is, the data table using ReplicatedMergeTree is a copy.

​ ReplicatedMergeTree is a derivative engine of MergeTree, which adds the ability of distributed collaboration on the basis of MergeTree.

preview

​ In MergeTree, a data partition will go through two types of storage areas from creation to completion.
​ (1) Memory: Data will be written into the memory buffer first.
​ (2) Local disk: The data will then be written to the tmp temporary directory partition, and the temporary directory will be renamed to the official partition after all is completed. ReplicatedMergeTree adds a part of ZooKeeper on the basis of the above. It will further create a series of monitoring nodes in ZooKeeper and realize communication between multiple instances. In the entire communication process, ZooKeeper does not involve the transmission of table data.

Characteristics of the copy

​ As the main implementation carrier of data copy, ReplicatedMergeTree has some notable features in design.
​ ❑ Rely on ZooKeeper: When executing INSERT and ALTER queries, ReplicatedMergeTree needs to use the distributed collaboration capabilities of ZooKeeper to achieve synchronization between multiple replicas. But when querying the copy, ZooKeeper is not needed. More information on this aspect will be introduced in detail later.
​ ❑ Table-level replicas: The replicas are defined at the table level, so the replica configuration of each table can be customized according to its actual needs, including the number of replicas and the distribution location of the replicas in the cluster.
​ ❑ Multi Master: INSERT and ALTER queries can be executed on any copy, and their effects are the same. These operations will be distributed to each copy and executed locally with the help of ZooKeeper's collaborative capabilities.
​ ❑ Block data block: When executing the INSERT command to write data, the data will be divided into several Block data blocks according to the size of max_insert_block_size (default 1048576 rows). Therefore, the block data block is the basic unit of data writing, and it has the atomicity and uniqueness of writing.
​ ❑ Atomicity: When data is written, either all data in a Block block is written successfully or all data fails.
​ ❑ Uniqueness: When writing a block data block, the hash information summary will be calculated and recorded in accordance with the data sequence, data row and data size of the current block data block. After that, if a block data block to be written has the same Hash digest as the previously written block data block (the data sequence, data size, and data rows in the block data block are the same), then the block data The block will be ignored. This design can prevent the repeated writing of the Block data block caused by abnormal reasons. If you simply look at the description of these characteristics, it may not be intuitive enough. It doesn't matter, it will be gradually expanded next, with a series of specific examples.

How to configure ZooKeeper

​ ClickHouse uses a set of zookeeper tags to define related configurations. By default, it can be defined in the global configuration config.xml. However, the Zookeeper configuration used by each copy is usually the same. In order to facilitate the copying of configuration files between multiple nodes, it is more common to extract this part of the configuration and save it in a separate file.

​ First, create a configuration file named metrika.xml in the /etc/clickhouse-server/config.d directory of the server:

<zookeeper-servers>
  <node index="1">
    <host>10.0.0.31</host>
    <port>2181</port>
  </node>
  <node index="2">
    <host>10.0.0.32</host>
    <port>2181</port>
  </node>
  <node index="3">
    <host>10.0.0.33</host>
    <port>2181</port>
  </node>
</zookeeper-servers>

​ Next, use the <include_from> tag in the global configuration config.xml to import the configuration just defined:
​ and reference the definition of the ZooKeeper configuration:

<!-- If element has 'incl' attribute, then for it's value will be used corresponding substitution from another file.
     By default, path to file with substitutions is /etc/metrika.xml. It could be changed in config in 'include_from' element.
     Values for substitutions are specified in /yandex/name_of_substitution elements in that file.
  -->
  
  <include_from>/etc/clikhouse-server/config.d/metrika.xml</include_from>

​ Among them, the node names in the incl and metrika.xml configuration files must correspond to each other. At this point, the entire configuration process is complete.

​ ClickHouse provides a proxy table named zookeeper in its system table. Through this table, you can use SQL query to read the data in the remote ZooKeeper. One thing to note is that in the SQL statement used for the query, the path condition must be specified, such as the query root path:

SELECT * FROM system.zookeeper where path = '/';
SELECT name,value,czxid,mzxid FROM system.zookeeper where path = '/clickhouse';
Definition form of copy

​ The use of copy increases the redundant storage of data, so it reduces the risk of data loss; secondly, because the copy adopts a multi-master architecture, each copy instance can be used as an entrance for data reading and writing, which undoubtedly amortizes the node's load.

​ When using replicas, there is no need to rely on any cluster configuration, ReplicatedMergeTree combined with ZooKeeper can complete all the work.

The definition of ReplicatedMergeTree is as follows:

ENGINE =ReplicatedMergeTree('zk_path','replica_name')

​ zk_path is used to specify the path of the data table created in ZooKeeper. The path name is customized and there is no fixed rule. Users can set any path they want. Even so, ClickHouse still provides some custom configuration templates for reference, for example:

/clickhouse/tables/{shard}/table_name

​ Among them:
❑ /clickhouse/tables/ is a fixed prefix of the path established by convention, indicating the root path of the data table.
​ ❑ {shard} represents the shard number, usually replaced by a number, such as 01, 02, 03. A data table can have multiple shards, and each shard has its own copy.
​ ❑ table_name represents the name of the data table. In order to facilitate maintenance, it is usually the same as the name of the physical table (although ClickHouse does not mandate that the table name in the path is the same as the physical table name); and the role of replica_name is defined in ZooKeeper. The name of the copy, which is a unique identifier to distinguish different copy instances. A conventional naming method is to use the domain name of the server where it is located.

​ For zk_path, different copies of the same shard of the same data table should define the same path; for replica_name, different copies of the same shard of the same data table should define different names.

Node structure in ZooKeeper

​ ReplicatedMergeTree needs to rely on ZooKeeper's event monitoring mechanism to achieve collaboration between replicas. Therefore, during the creation of each ReplicatedMergeTree table, it will use zk_path as the root path to create a set of monitoring nodes for this table in Zoo-Keeper. According to different functions, monitoring nodes can be roughly divided into the following categories:

​ (1) Metadata:
❑ /metadata: Save metadata information, including primary key, partition key, sampling expression, etc.
​ ❑ /columns: Save column field information, including column name and data type.
​ ❑ /replicas: Save the name of the copy, corresponding to the replica_name in the setting parameters.

​ (2) Judgment flag:
❑ /leader_election: used for the election of the master copy, the master copy will dominate the MERGE and MUTATION operations (ALTER DELETE and ALTER UPDATE). After these tasks are completed in the master copy, ZooKeeper is used to distribute message events to other copies.
​ ❑ /blocks: Record the hash information summary of the Block data block and the corresponding partition_id. Through the Hash summary, it can be judged whether the Block data block is duplicated; through the partition_id, the data partition that needs to be synchronized can be found.
​ ❑ /block_numbers: Record the partition_id in the same order according to the writing order of the partition. When each copy is MERGE locally, it will follow the same block_numbers sequence.
​ ❑ /quorum: Record the number of quorum. The entire write operation is considered successful when at least a copy of the quorum number is successfully written. The number of quorum is controlled by the insert_quorum parameter, and the default value is 0.

​ (3) Operation log:
❑ /log: Regular operation log nodes (INSERT, MERGE, and DROP PARTITION). It is the most important part of the entire working mechanism and saves the task instructions that the copy needs to execute. Log uses ZooKeeper's persistent sequential nodes, and the name of each instruction is prefixed with log-, such as log-0000000000, log-0000000001, etc. Each replica instance will monitor the /log node. When a new instruction is added, they will add the instruction to the respective task queue of the replica and execute the task. The execution logic in this regard will be further expanded later.

​ ❑ /mutations: MUTATION operation log node, its function is similar to log log. When executing ALERTDELETE and ALERT UPDATE query, operation instructions will be added to this node. Mutations also use ZooKeeper's persistent sequential nodes, but its naming does not have a prefix, and each instruction is directly stored in the form of increasing numbers, such as 0000000000, 0000000001, etc. The execution logic in this respect will also be expanded later.

​ ❑ /replicas/{replica_name}/*: A group of monitoring nodes under each node of each replica, which are used to guide the replica to execute specific task instructions locally. The more important nodes are as follows:
​ ❍ /queue : Task queue node, used to perform specific operation tasks. When the copy listens to the operation instruction from the /log or /mutations node, it will add the execution task to the node and execute it based on the queue.
​ ❍ /log_pointer:log log pointer node, which records the log subscript information of the last execution, for example, log_pointer:4 corresponds to /log/log-0000000003 (counting from 0).
​ ❍ /mutation_pointer: the mutation log pointer node, which records the name of the mutation log that was executed last time, for example, mutation_pointer:0000000000 corresponds to /mutations/000000000.

The core execution process of INSERT

​ When you need to execute an INSERT query in ReplicatedMergeTree to write data, it will enter the INSERT core process

preview

preview

​ Create the first replica instance
​ Suppose you start with the CH5 node first, and after executing the following statement on the CH5 node, the first replica instance will be created:

CREATE TABLE replicated_sales_1 (
    id String,
    price Float64,
    create_time DateTime
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/01/replicated_sales_1','ch5.nauu.com')
partition by toYYYYMM(create_time)
ORDER BY id ;

​ During the creation process, ReplicatedMergeTree will perform some initialization operations, such as:

​ ❑ Initialize all ZooKeeper nodes according to zk_path.
​ ❑ Register your own replica instance ch5.nauu.com under the /replicas/ node.
​ ❑ Start the monitoring task and monitor the /log log node.
​ ❑ Participate in the election of replicas and elect the master replica. The election method is to insert child nodes into /leader_election/. The first replica successfully inserted is the master replica.

​ Then, execute the following statement on the CH6 node to create a second copy instance. The table structure and zk_path need to be the same as the first replica, and replica_name needs to be set to the domain name of CH6:

CREATE TABLE replicated_sales_1 (
    id String,
    price Float64,
    create_time DateTime
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/01/replicated_sales_1','ch6.nauu.com')
partition by toYYYYMM(create_time)
ORDER BY id ;

​ During the creation process, the second ReplicatedMergeTree will also perform some initialization operations, such as:
​ ❑ Register its own replica instance ch6.nauu.com under the /replicas/ node.
​ ❑ Start the monitoring task and monitor the /log log node.
​ ❑ Participate in the election of duplicates and elect the master duplicate. In this example, the CH5 copy becomes the master copy.

​ Now try to write data to the first copy CH5. Execute the following commands:

INSERT INTO TABLE replicated_sales_1 VALUES ('A001',100,'2019-05-10 00:00:00');

​ After the above command is executed, the writing of the partition directory will be completed locally:

preview

​ Then write the block_id of the data partition to the /blocks node:

preview

​ The block_id will be used as the basis for the subsequent deduplication operation. If you execute the INSERT statement again at this time and try to write duplicate data, the following prompt will appear:

preview

​ That is, the copy will automatically ignore the data to be written with repeated block_id.
​ In addition, if the insert_quorum parameter is set (the default is 0) and insert_quorum>=2, CH5 will further monitor the number of copies that have completed the write operation. Only when the number of written copies is greater than or equal to insert_quorum, the entire write The entry operation is considered successful.

​ Push Log logs from the first replica instance

​ After the completion of the 3 steps, the copy of the executed INSERT will continue to push the operation log to the /log node. In this example, the first copy, CH5, will take on this important task. The log number is /log/log-0000000000, and the core attributes of LogEntry are as follows:

preview

​ It can be seen from the log content that the operation type is get download, and the partition that needs to be downloaded is 201905_0_0_0. All other copies will execute commands in the same order based on the Log.

​ The second replica instance pulls the log log​ The
CH6 replica will always monitor the /log node changes. When CH5 pushes /log/log-0000000000, CH6 will trigger the log pull task and update the log_pointer to point it to the latest Log subscript:

preview

After the LogEntry is pulled, it will not be executed directly, but will be turned into a task object and placed in the queue:

preview

​ This is because in complex situations, considering that many LogEntry will be received continuously in the same time period, it is a more reasonable design to digest tasks in the form of queues. Note that the pulled LogEntry is an interval, and this is also because you may receive multiple LogEntry consecutively.

​ The second copy instance initiates a download request to other copies

​ CH6 starts to execute tasks based on the /queue queue. When you see that the type is get, ReplicatedMerge-Tree will understand that the data partition has been successfully written in other remote replicas at this time, and you need to synchronize these data.

​ The second copy instance on CH6 will start to select a remote copy as the data download source. The selection algorithm of the remote copy is roughly as follows:

​ (1) Get all the replica nodes from the /replicas node.

​ (2) Traverse these copies and select one of them. The selected copy needs to have the largest log_pointer subscript, and the number of /queue child nodes is the smallest. The largest log_pointer subscript means that the copy executes the most logs and the data should be more complete; and the smallest /queue means that the current task execution burden of the copy is smaller.

​ In this example, the remote copy selected by the algorithm is CH5. Therefore, the copy of CH6 initiates an HTTP request to CH5, hoping to download the partition 201905_0_0_0:

preview

​ If the first download request fails, by default, CH6 will try the request 4 times, a total of 5 attempts (controlled by the max_fetch_partition_retries_count parameter, the default is 5).

​ The DataPartsExchange port service of CH5 receives the call request. After knowing the intention of the other party, it responds according to the parameters and sends the local partition 201905_0_0_0 based on the DataPartsExchang service response back to CH6:

​ Sending part 201905_0_0_0

After the CH6 copy receives the partition data of CH5, it first writes it to the temporary directory:

​ tmp_fetch_201905_0_0_0

After all the data is received, rename the directory:

preview

​ At this point, the entire writing process is over.

​ It can be seen that ZooKeeper will not perform any substantial data transmission during the INSERT writing process. Based on the principle of who is responsible for implementation, in this case, CH5 first writes the partition data locally. After that, this copy is also responsible for sending Log logs and notifying other copies to download data. If insert_quorum is set and insert_quorum>=2, the number of copies that have been written will also be monitored by the copy. After the other copies receive the Log, they will select the most suitable remote copy and download the partition data point-to-point.

ALTER's core execution process

​ When performing an ALTER operation on ReplicatedMergeTree to modify metadata, it will enter the logic of the ALTER part, such as adding and deleting table fields. The core process of ALTER is shown in the figure

preview

​ Compared with the previous processes, the process of ALTET will be much simpler, and the distribution of /log logs will not be involved in the execution process. The entire process proceeds in chronological order from top to bottom, and is roughly divided into 3 steps. Now explain the whole process according to the numbers shown in the figure.

​ Modify shared metadata
​ At the CH6 node, try to add a column field and execute the following statement:

preview

​ After execution, CH6 will modify the shared metadata node in ZooKeeper:

preview

​ After the data is modified, the version number of the node will also be upgraded at the same time:

preview

​ At the same time, CH6 will also be responsible for monitoring the completion of the modification of all copies:

preview

​ Monitor changes in shared metadata and perform local modifications individually

​ The two copies of CH5 and CH6 monitor the changes of shared metadata. After that, they will compare the local metadata version number with the shared version number. In this case, they will find that the local version number is lower than the shared version number, so they start to perform update operations on their respective locals:

preview

​ Confirm
that all copies have been modified​ CH6 confirm that all copies have been modified:

preview

​ At this point, the entire ALTER process is over.
​ It can be seen that ZooKeeper will not perform any substantial data transmission during the entire execution of ALTER. All ALTER operations are ultimately completed locally by each copy. Based on the principle of who is responsible for implementation, in this case, CH6 is responsible for the modification of the shared metadata and the monitoring of the modification progress of each copy.

For more exciting content, please follow the WeChat public account to get

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_45320660/article/details/112761872