Features and usage of copy

Abstract

Replicas are the basis for the realization of the distributed and high-availability characteristics of Yunxi database. This article focuses on the entire process of replica change. First, it explains the classification method of replicas, and introduces in detail the functions of various replicas newly added by Yunxi database on the basis of raft algorithm. characteristic. Configure zone is the main way to control the number, type and distribution of replicas. We will fully understand how to use configure zone from syntax, fields and features. Through the addition, deletion and balance of replicas, the number, type and distribution of replicas will meet the configure zone as much as possible, and the performance and resource utilization of the entire database will also be improved. Finally, it discusses the management mechanism of Yunxi database for data fragmentation-how to decide and carry out Range split and merge.

Part 1 - Types of Replicas

 

1. Copy classification

                        

Figure 1. How replicas are categorized

(1) Copies are divided into voter and non-voter according to whether they have voting rights (Note: voting rights include raft leader election voting rights and raft log submission voting rights).

(2)  voter is divided into universal copy (Universal) and log copy (Logonly) according to whether to store user data.

(3) Non-voter currently has a column storage copy (Column), which uses a column storage engine to provide consistent read services.

(4) The voter is divided into a strong synchronization copy (Synchronizer) and a non-strong synchronization copy according to whether it has a one-vote veto when the log is submitted.

 

 2. Strong synchronous replica feature

(1) It can be configured on any voter copy.

(2) There is a one-vote veto right when the log is submitted.

The original log submission strategy of the Raft protocol: more than half of the replicas are required to vote.

After the introduction of strong synchronous replicas, the new log submission strategy requires more than half of the replicas to vote, and all strong synchronous replicas to vote.

If a strongly in-sync replica fails, writes are blocked until the failure is recognized. After the failure is identified, previously blocked writes are committed successfully and do not need to be rolled back.

This feature can be used for synchronous writing across regions.

(3) Fault identification and recovery

Failure identification of strongly in-sync replicas is based on raft heartbeats. When the heartbeat sent to the strong synchronization replica is not replied and reaches the strong synchronization timeout period, the leader identifies the strong synchronization failure, and the log submits the strong synchronization copy that ignores the failure.

The node where the leader is located will generate a log to inform which store, which table and which range has a strong synchronization failure, and record the strong synchronization failure event to inform the failed store (you can view it in AdminUI-Metrics-Events, or through "select* from" system.eventlog" query).

Figure 2. Strong sync failure log

After the strong synchronization copy resumes work, the strong synchronization identifier will be restored, and corresponding logs and events will also be generated.

The default value of the strong synchronization timeout is 100 (unit tick, equivalent to 20s), which can be configured by the user and must be greater than 5 (heartbeat timeout, indicating the interval for sending heartbeats).

用例:setcluster setting raft.synchronizer_timeout_ticks = 50;

 

3. Journal copy feature

(1) Only store raft logs, not user data

(2) Has voting rights and can be elected as raft leader, but cannot send snapshots

(3) Cannot be a lease holder

A journaled copy cannot be a lease holder and can guarantee:

● Cannot read and write data from journaled replicas;

● Since the leader will try to be consistent with the lease holder, when the log replica is elected as the leader, as long as there is an all-around replica with the latest data, the leader will immediately transfer to the all-around replica.

(4) Log-type copy keeps logs

● The original raft log cleaning strategy:

 Step1 . Calculate the maximum index value of cleanable logs.

(i) Make sure that the log to be cleaned has been appended to the majority of replicas.

(ii) For active replicas, to avoid sending snapshots, make sure that the log has been appended to each replica.

( iii) When there is a dead replica, the last index of the replica is protected until the log size exceeds 4MB.

(iv) Protect the index of the pending snapshot.

Step2 . If the number of cleanable log indexes is greater than or equal to 100, or if the number of indexes is greater than 0 and the actual log size is greater than or equal to 64KB, initiate a TruncateLogRequest, and each copy will immediately apply truncateState. 

Figure 3. Type of truncateState, including index and term fields

 

● New strategy:

The conditions for initiating TruncateLogRequest remain unchanged.

The log copy intercepts the truncateState, merges the records by hour, and does not apply it for the time being; scans the intercepted records, finds the last record that reaches the log copy log cleanup time (user-configurable, minimum -1), applies and clears the record.

Figure 4. Log cleanup logic for log replicas

● The default value of the log cleanup time of a log copy is 24, which means that the actual log retention time of a log copy is at least 24 hours.

      ● set clustersetting raft.logonly_truncate_hours = -1; indicates that the log-reservation function of log-type replicas is disabled.

 

(5) Offsite restart of log copy

● Deployment mode: two data centers and three copies (all-purpose copy, strong-sync copy, log-type copy), the all-round copy and strong-sync copy are stored in the high-end machines (or most machines) of DC1 and DC2 respectively, and the log-type copy is stored In a low-profile machine (or a small number of machines) in DC1 or DC2, and replicate the log deltas to a low-profile machine (or a small number of machines) in another data center.

Figure 5. Deployment mode of off-site restart of log copy

● Disaster recovery processing: When two copies (one all-around and one log) are lost in the event of a data center-level failure, the node storing the log copies is manually restarted in another data center based on the log data obtained by incremental replication, and the cluster Availability is restored.

Figure 6. Disaster recovery of log-type replicas restarting in different places

4. Non-voter replica feature

(1) Store raft logs and user data.

(2) No voting rights, can not be elected as raft leader and lease holder.

(3) Provide learner read consistent read service.

  System parameters for learner read tuning:

kv.closed_timestamp.target_duration: time interval for updating Closed Timestamp, default 3s

kv.closed_timestamp.learner_read_retries: Number of retries after learner read failure, default 8 times

 

Part 2 - Configure Zone Usage

 

1. The minimum data sharding unit of Yunxi database is range, and the range size, GC time and replica distribution are controlled by configurezone.

2. Configurezone syntax

Figure 7. Alter Configure Zone syntax diagram

● object_name: including the name of database, table, index, partition and range.

      ● COPY FROMPARENT: Copies the value of this field of the parent-level object.

      ● DISCARD: clear the configure zone.

 

3. Copy related fields

( 1) num_replicas: The number of omnipotent replicas.

(2) num_logonlys: The number of log replicas.

(3) num_columns: The number of column storage copies.

(4) constraints: A list of all-purpose replica constraints, used with node startup parameters such as locality and attrs. The default is empty, the replicas will be randomly distributed evenly.

● Startup parameter format:

--locality=key1=value1, key2=value2, …

--store=storedir, attrs=attr1 :attr2: …

● Two formats for constraints:

(i)constraints='[+(-)key1=value1,+(-)key2=value2, …,+(-)attr1,+(-)attr2, …]'

Use case: '[+region=US, -dc=dc1,+ssd]' means to put all omnipotent copies on the store of region=US, dc≠dc1, and ssd.

(ii)constraints='{"+region=west,+raid": 2, "+region=east,+dc=dc3,+ssd":1}' means put 2 on the store of region=west and raid Copy, put 1 copy on the store of region=east, dc=dc3, ssd

(5) logonly_constraints: a list of log-type replica constraints, the usage is the same as constraints.

(6) column_constraints: column storage copy constraint list, usage is the same as constraints.

(7) lease_preferences: a list of lease holder preference constraints.

Format: '[[+(-)key1=value1,… ,+(-)attr1, …], [+(-)key1=value2, … ,+(-)attr2,…], …]'

Use case: '[[+region=east],[+region=west,-dc=dc3,+ssd]]' means that if there is an omnipotent copy on the store in region=east, select one of them to become the leaseholder; otherwise, if Region=west, dc≠dc3, and SSD stores have all-purpose copies, and one of them is selected to become the leaseholder; if there are no all-purpose copies that meet the above constraints, it will be automatically allocated.

(8) strong_synchronizations: strong synchronization replica constraints, the format is the same as lease_preferences, the difference is that all replicas that meet the constraint list are all configured as strong synchronization replicas.

用例:strong_synchronizations='[[+region=east], [+region=west,-dc=dc3,+ssd]]'

Configure the replicas on the store with region=east and region=west, dc≠dc3, and ssd as strong synchronous replicas;

strong_synchronizations= '[[+region=east]]'Cancel one of the strong synchronization configurations;

strong_synchronizations='[]'Cancel all strong synchronization configurations.

 

4. configure zone feature

    (1) Default inheritance: All unconfigured or cleared configure zones inherit the parent-level configure zone by default, and the range default configure zone without a parent-level.

 

Figure 8. Example of Configure zone inheritance relationship

(2) Incremental modification: The fields included in the alter configure zone statement will be modified, and the fields not included will remain unchanged.

      (3) Replica change detection: ReplicateQueue, which is responsible for controlling replica changes, scans all ranges and checks whether replica changes are required one by one; in addition, when it is detected that the replica-related fields of the configure zone of a range are modified, it will directly trigger the replica of the range changes, thereby speeding up the changes to take effect.

      (4) Sequential configuration: The number of 6 voter replicas and the configuration of fields related to constraints must follow the topology sequence shown in the figure below.

 

Figure 9. Topological order of Configure zone configuration

 

(5) Error checking:

● The minimum value of num_replicas is 1.

● num_replicas must be greater than num_logonlys.

● num_replicas + num_logonlys is equal to 1 or greater than or equal to 3.

● The total number of replicas constrained by constraints cannot exceed num_replicas, and the total number of replicas constrained by logonly_constraints cannot exceed num_logonlys.

● Constraints of strong_synchronizations must appear in constraints or logonly_constraints.

● The key=value or attr appearing in the constraint matches at least one node/store.

● range_min_bytes and range_max_bytes must be configured at the same time.

(6) Association inheritance: In order to prevent the configure zone from skipping error checking through inheritance and causing confusion, the fields num_replicas, num_logonlys, constraints, logonly_constraints, and strong_synchronizations are all empty, and these fields can only be inherited from the parent level.

 

5. Some special ranges and their default values ​​for configure zone.

● range default: The default configuration, which does not correspond to the actual range. 3 copies, GC time 90000s.

● range meta: metadata, rangeID=1. The default is 5 copies, and the GC time is 3600s.

● range liveness: node liveness status, rangeID=2. The default is 5 copies, and the GC time is 600s.

● range system: system , rangeID=3, 5. The default is 5 copies, and the GC time is 90000s.

● range timeseries: time series data, rangeID=4. Not configured by default.

● database system: system database. The default is 5 copies, and the GC time is 90000s.

● table system.jobs: jobs table. The default is 5 copies, and the GC time is 600s.

 

6. View the configured Configure zone

Figure 10. View the syntax diagram of ConfigureZone

 

● object_name: including database, table, index, partition and range.

● Show all zone configurations; View all configured configure zones.

 

7. Query replica distribution and configure zone validity

(1)show experimental_ranges from table table_name;

Figure 11. View the range of the specified table

 

(2)select * from zbdb_internal.ranges;

(Note: This statement will query the range that has been deleted but not GC. )

Figure 12. View all range cases

 

Part 3 - Copy Changes

 

1. Replica changes are done through replicateQueue

(1) Based on the existing replicas, get have (number of existing replicas), deadReplicas (dead replicas), decommissionReplicas (decommissioned replicas).

(2) Based on the number of available nodes and the configure zone, calculate the need (the number of replicas required).

(3) Calculate which change should be made based on have, need, deadReplicas, decommissionReplicas—add, delete, or equalize.

Figure 13. Decision tree calculating which replica changes should be made

(4) When an odd number of voters are configured, the situation of forming an even number of voters will be avoided.

( 5) Copy changes cannot be made when the number of surviving voters is less than half.

 

2. Mechanism of replica balance: compare whether there is a better place to add replicas to replace existing replicas. The comparison criteria are listed in order of priority as follows:

(1) The replica position satisfies the constraint

( 2) Disk remaining size

(3) Whether a copy is necessary

(4) Diversity score - the more scattered the copy, the higher the score

(5) Converge score - the range number tends to the average, the higher the score

(6) Balance score - calculated from the number of ranges, disk usage, and QPS

(7) Range number - the less the better

 

3. Addition, deletion and balance of multiple types of replicas

(1) Replicas cannot directly convert types between omnipotent replicas, journal replicas, and columnar replicas.

( 2) When multiple types of replicas do not meet the configured number of replicas/constraints, and a certain type of replica occupies the position of other types of replicas, an additional spare node is required to add, delete and balance replicas.

(3) When the number of nodes is insufficient, the omnipotent copy is given priority.

 

Part 4 - Range Split and Merge

 

1. Conditions that trigger Range splitting

(1) Create a new database, table, etc.

(2) Range size exceeds range_max_bytes.

  ( 3)  Range  QPS is too high: QPS is greater than kv.range_split.load_qps_threshold (default value is 250, configurable), and the range is listed as a split candidate.

(4) Modify the configurezone of the index or partition to make it independent from the parent level.

(5) When importing a large amount of data, it will be automatically split into multiple ranges.

( 6) When importing data, a blank range will be pre-split for subsequent data that may be imported.

( 7) Manual split: alter table table_name split at values(key1,key2, …)

  Values ​​is the primary key value. If it is a joint primary key, multiple values ​​can be written, but cannot exceed the number of primary key columns.

 

 2. The impact of Range splitting on replicas

(1) If the split occurs within a table and is not caused by modifying the configurezone of the index and partition: the number, location, type, and strong synchronization configuration of the newly split range remain unchanged during initialization.

(2) Other situations: The number and position of voter copies remain unchanged when the newly split range is initialized, but all are all-purpose copies, excluding non-voter copies and strong synchronization marks. If necessary, make replica changes according to the configure zone corresponding to this range.

 

3. Condition of Range merging

Figure 14. Conditions that need to be met for range merging

Disable merge with the following command:

SET CLUSTERSETTING kv.range_merge.queue_enabled = false;

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324094016&siteId=291194637