Dry goods|HBase Replication detailed explanation

Starting from the overall situation, this article explains in detail the usage and practice of HBase's Replication and Replication Endpoint.

Replication: Replication refers to the continuous copying of the same data to multiple places for storage. It is a common and important concept in various storage systems. It can refer to the replication of the master library and the slave library in the database, or it can refer to Replication between multiple clusters in a distributed cluster can also refer to replication between multiple replicas in a distributed system. Its difficulty lies in the fact that data is usually constantly changing, and it is necessary to continuously reflect changes to multiple data copies and ensure that these copies are completely consistent.

In general, data replication to multiple copies has the following benefits:

  • Multiple backups increase data reliability

  • Separate OLTP and OLAP requests through replication between master-slave databases/master-standby clusters

  • Improve availability, even if a single copy is down, there are still other copies to provide read and write services

  • Scalable, serving more read and write requests by adding copies

  • For replication between data centers across regions, the Client reduces request latency by reading and writing to the nearest data center

Replication in HBase refers to the replication between the active and standby clusters, which is used to copy the write records of the active cluster to the standby cluster. HBase currently supports three kinds of Replication, namely asynchronous Replication, serial Replication and synchronous Replication.

Asynchronous Replication

If you want to understand the Replication of HBase, you first need to understand the architecture of HBase.

The HBase cluster is composed of a group of processes. The processes are divided into Master and RegionServer according to their roles. The Master is responsible for DDL operations, such as creating and deleting tables, while the RegionServer is responsible for DML operations, such as data read and write operations. From the data view point of view, the Table in HBase is divided into multiple Regions according to the Range, and then different RegionServers are responsible for providing external services.

picture

The RegionServer is mainly composed of BlockCache, MemStore, and WAL. It should be noted that each Column Family in each Region has its own exclusive MemStore, but BlockCache and WAL are shared by multiple Regions. WAL (Write-ahead logging) is a common technology in databases. All modifications need to be persisted to WAL before being written to the database, so as to ensure that in the event of a failure, the successfully written data can be played back from WAL. data.

picture

Replication in HBase is also based on WAL. A thread called ReplicationSource is set up inside each RegionServer process of the main cluster to be responsible for Replication, and a ReplicationSink thread is set up inside each RegionServer of the standby cluster to receive Replication data. . ReplicationSource records the WAL queue that needs to be synchronized, and then continuously reads the content in WAL. At the same time, it can do some filtering according to the configuration of Replication, such as whether to copy the data of this table, etc., and then send it to the RegionServer of the standby cluster through the Rpc call of replicateWALEntry , the ReplicationSink thread of the standby cluster is responsible for converting the received data into put/delete operations and writing them into the standby cluster in the form of batches.

picture

Because the background thread reads WAL asynchronously and copies it to the standby cluster, this type of replication is called asynchronous replication. Under normal circumstances, the delay for the standby cluster to receive the latest written data is at the second level.

Serial Replication

Serial Replication refers to: For a Region, it is copied to the standby cluster in strict accordance with the writing order of the primary cluster, which is a special type of Replication. At the same time, the default asynchronous replication is not serial. The main reason is that Regions can be moved. For example, HBase moves Regions during load balancing. Assuming that RegionA is first on RegionServer1 and then moved to RegionServer2, due to the delay in asynchronous replication, the last part of the records written in RegionA has not been fully replicated to the standby cluster. After the Region moves to RegionServer2, it starts to receive new write requests and is copied to the standby cluster by RegionServer2, so at this time RegionServer1 and RegionServer2 will simultaneously replicate to the standby cluster, and the order in which write records are copied to the standby cluster is uncertain.

picture

In such an extreme situation as shown in the figure above, it will also lead to data inconsistency between the active and standby clusters. For example, the last unsynchronized write operation on RegionServer1 is Put, and the first write operation when RegionA is moved to RegionServer2 is Delete. On the main cluster, the write order is Put first and then Delete. If Delete on RegionServer2 The operation is copied to the standby cluster first, and then the standby cluster performs a major compaction, which deletes the Delete marker, and then the Put operation is synchronized to the standby cluster, because the Delete has been deleted by the Major compact, and the Put will never be executed. Delete, so the standby cluster will have more data than the primary cluster.

The key to solving this problem is to ensure that new write operations on RegionServer2 must be replicated after the write operations on RegionServer1 have been replicated. Therefore, serial replication introduces a concept called Barrier. Whenever a Region is opened, a new Barrier will be written, whose value is the maximum SequenceId read when the Region is opened plus 1. SequenceId is an important concept in HBase. Each Region has a SequenceId, which is strictly incremented as data is written. At the same time, SequenceId is written to WAL with each write operation. So when the Region is moved, the Region will be reopened on the new RegionServer, and a new Barrier will be written at this time. After the Region is moved multiple times, multiple Barriers will be written to divide the write operation of the Region. into multiple intervals. At the same time, each Region maintains a lastPushedSequenceId, which represents the SequenceId of the last write operation successfully pushed by the Region. In this way, it is possible to judge whether a write operation in the WAL can be replicated to the standby cluster based on the barrier list and lastPushedSequenceId.

picture

Take the above figure as an example, the Pending write record needs to wait for the lastPushedSequenceId to be pushed to Barrier2 before copying can start. Since there will only be one RegionServer responsible for replication between each interval, only the RegionServer in the same interval as lastPushedSequenceId can perform replication, and the lastPushedSequenceId will be updated continuously after the replication is successful, while the RegionServers in each interval after the lastPushedSequenceId need to wait lastPushedSequenceId is pushed to the starting Barrier of its own interval, and then the replication can start, thus ensuring that the write operations of the Region can be copied to the standby cluster in strict accordance with the writing order of the primary cluster.

Synchronous Replication

Synchronous Replication is a symmetrical concept to asynchronous Replication, which means that the write operations of the primary cluster must be synchronously written to the standby cluster. The biggest problem with asynchronous replication is that there is a delay in replication. Therefore, when the entire cluster of the primary cluster is down, the standby cluster does not have complete data that has been written. For businesses that require high consistency, it is not possible to Read and write are completely switched to the standby cluster, because at this time there may be some recently written data that cannot be read from the standby cluster. Therefore, the core idea of ​​synchronous replication is to write a copy of RemoteWAL on the standby cluster while writing to the WAL of the primary cluster. entered successfully. In this way, when the main cluster hangs up, all write records on the main cluster can be played back on the standby cluster according to the Remote WAL, so as to ensure the data consistency between the standby cluster and the main cluster.

picture

It should be noted that synchronous replication is based on asynchronous replication, that is to say, the replication link of asynchronous replication will continue to be retained, and a new step of writing Remote WAL will be added. For the specific implementation details, the concept of a Sync replication state is first added, which has three states in total, namely Active, Downgrade Active and Standby. The conversion relationship of these states is shown in the figure below. When the Standby is promoted to the master, it needs to be promoted to Downgrade Active first, and then it can be promoted to Active. But Active can be directly downgraded to Standby. At present, this state is saved in ReplicationPeerConfig, which indicates which state a cluster is in in this ReplicationPeer.

picture

Then a DualAsyncFSWAL is implemented to write the WAL of the main cluster and the Remote WAL of the backup cluster at the same time. The operation of writing WAL is an rpc request to HDFS, which has three results: success, failure or timeout. When the timeout occurs, the result is uncertain for HBase, that is, the data may or may not have been successfully written to WAL or Remote WAL. Only when the write is successful or fails at the same time, the primary cluster and the standby cluster will have the same WAL. If the primary cluster writes the WAL successfully, but the remote WAL fails or times out, the data in the primary cluster WAL may be higher The remote WAL of the standby cluster is large. On the contrary, if the writing of the remote WAL of the standby cluster succeeds, but the writing of the WAL of the primary cluster fails or times out, the data in the remote WAL of the standby cluster may be more than that of the primary cluster. When both sides time out, it's not sure which side is more. Therefore, the key to synchronous replication is how to ensure the final consistency of the data in the active and standby clusters under the above circumstances. That is, when switching between the active and standby clusters, the Client should always see consistent data from the active and standby clusters. Moreover, when the master and backup have not reached a consistent intermediate state, some restrictions are needed to ensure that the Client cannot read such intermediate inconsistent results. So to sum up, the master and slave clusters are eventually consistent, but for the client, they are strongly consistent, that is, the successfully written data must be readable regardless of the master and slave clusters. For specific implementation details, please refer to HBaseCon Asia 2018: HBase at Xiaomi[1].

picture

Compared with asynchronous replication, synchronous replication mainly affects the write path. According to our test results, there will be about 14% performance degradation. The follow-up plan is to optimize in HBASE-20422[2].

Custom Replication Endpoints

In addition to the above three types of Replication, HBase also supports plug-in Replication Endpoints, which can be customized to implement various functions. Specifically for Xiaomi, we have implemented a Replication Endpoint that can be used between different tables. For example, table A of the primary cluster can be copied to table B of the standby cluster. The application scenarios include cluster migration, Namespace replacement, or table name replacement. At the same time, in order to achieve Point-in-time Recovery, we have made a Replication Endpoint that can synchronize data to the Xiaomi message queue Talos. When there is a scene that needs to restore a certain time point t1, we can first find the one closest to t1 in the cold backup Snapshot is restored, and then the data in the message queue is played back to the t1 time point, so as to achieve Point-in-time Recovery.

Finally, we also implemented a function similar to DynamoDB Stream [3], copying the modification log of the user table to the user's own message queue, and then the user can rely on this data to do some streaming data processing. In addition, the current implementation of the HBase Read Replica function relies on HBase Replication to replicate the data of the primary Replica to other Read Replicas through the plug-in Replication Endpoint. For details, see HBASE-10070[4].

The above are all kinds of Replication in HBase. If there are any mistakes, please correct me.

At the same time, you are welcome to use it as needed in business scenarios. You can also customize new Replication Endpoints according to your own special scenarios, and welcome contributions to the community.

Reference link:

  1. https://www.slideshare.net/MichaelStack4/hbaseconasia2018-track13-hbase-at-xiaomi

  2. https://jira.apache.org/jira/browse/HBASE-20422

  3. https://docs.aws.amazon.com/zh_cn/amazondynamodb/latest/developerguide/Streams.html

  4. https://issues.apache.org/jira/browse/HBASE-10070

  5. https://mapr.com/blog/in-depth-look-hbase-architecture/

picture

Guess you like

Origin blog.csdn.net/weixin_47158466/article/details/120156249
Recommended