Geo-Replication in Apache Pulsar, Part 1: Concepts and Features

Summary:

This article introduces an enterprise-level feature in Apache Pulsar: geo-replication. This capability provides an essential disaster prevention and recovery strategy for today's businesses. While most other messaging systems can only rely on replication between two data centers, Pulsar can scale indefinitely on demand. In addition, this article will introduce the two types of geographical replication supported by Pulsar: synchronous replication and asynchronous replication, and the applicable scenarios of each type.

text:

Disaster recovery planning, and even more ideally disaster prevention planning, cannot be overemphasized, and weekly headlines confirm this conclusion. Regardless of industry, if an unforeseen event impacts day-to-day operations, organizations need to get back up and running as quickly as possible to continue serving their customers. From data security breaches to natural disasters, planning must be in place to respond quickly and flexibly to catastrophic events. Failure to have an effective disaster recovery plan in place can expose an organization to a variety of risks, such as large financial losses, reputational damage, and even more severe risks to the organization's customers and users.

In a multi-faceted enterprise software system, disaster prevention strategies and recovery planning need to be deployed in multiple geographically dispersed data centers. In such multi-data center deployments, geo-replication mechanisms are often used to provide additional redundancy in case a data center failure or other event prevents business as usual.

This and the next one will introduce another enterprise-grade feature that comes with Apache Pulsar: geo-replication . Apache Pulsar uses Apache BookKeeper, a scalable streaming storage mechanism, a messaging system that spans multiple data centers and supports both synchronous geo-replication (with Apache BookKeeper) and asynchronous geo-replication (through Broker-level configuration). . The first article will introduce some simple concepts and functions, and the next article will focus on specific deployment practices.

concept

Geo-replication is a typical disaster recovery mechanism. Although many data systems claim to support geo-replication, these systems are usually only replicated to two data centers, and there are often limitations when replicating to more locations. This can be confusing for users, and it's not a cumbersome way to try and replicate to multiple data centers. Before introducing the geographical replication function of Apache Pulsar, the following will first introduce some basic concepts about geographical replication.

The geo-replication mechanisms used by many data systems can be mainly divided into two categories: synchronous geo-replication and asynchronous geo-replication. Apache Pulsar supports both types. Figure 1 below shows the differences between synchronous and asynchronous geo-replication.

This example assumes 3 data centers: us-west , us-central and us-east , and clients issue write requests to us-central . If synchronous geo-replication is used, after a client initiates a request to us-central, the data written to us-central will be replicated to two other data centers: us-west and us-east. Clients typically get an acknowledgment for a write request only after a majority of the datacenters have confirmed that the write operation has completed successfully. That is, in this example, at least two data centers are required to confirm that the write request has completed successfully. This mechanism is also called "synchronous geo-replication" because data is replicated to multiple data centers in a synchronous manner, and clients must wait for confirmation from other data centers.

geo-replication-in-pulsar.jpeg-166.2kB
Figure 1: Synchronous Geo-Replication and Asynchronous Geo-Replication

In contrast, in asynchronous geo-replication, the client does not need to wait for a response from other data centers, and the client will receive a response immediately after us-central successfully completes the write operation. The data is then replicated from us-central to two other data centers: us-west and us-east, but asynchronously.

Synchronous geo-replication provides the highest availability, where all available physical data centers will provide a globalized logical instance of the data system. Applications can run in any data center and access this data at any time. This model also ensures stronger data consistency across different data centers, ensuring that applications continue to run without any manual action if one data center fails. However, applications suffer from additional cross-datacenter latency, typically on the order of tens of milliseconds for such communications between the east and west coasts of the United States.

Asynchronous geo-replication has lower latency because clients do not need to wait for responses from other data centers. However, this mode provides weak consistency guarantees, since replication is asynchronous after all. Since asynchronous replication always has replication lag (replication lag usually means that data has not yet been replicated from the source to the target), there is always data that has not been fully replicated from the source to the target. When faced with a disaster that affects the entire data center (such as natural disasters such as floods, fires, earthquakes, or power outages, etc.), the data that has not been replicated will be lost. Due to replication lag, applications are often developed or configured to account for such an entire data center failure. Asynchronous geo-replication is usually mainly used in occasions where consistency requirements are not so high, such as message systems or non-database storage systems.

Apache Pulsar relies on Apache BookKeeper for persistent message storage, which supports both geographic replication methods.

Before going into the details, I'm going to briefly talk about a typical Pulsar installation process, which helps explain how Apache Pulsar supports both synchronous and asynchronous geo-replication.

Figure 2 shows a typical installation of Apache Pulsar. A Pulsar cluster consists of two layers: a stateless Serving Layer, which contains a series of Brokers that serve pub/sub traffic; and a stateful Persistence layer, which contains A series of BookKeeper bookies for providing persistent message storage capabilities.

pulsar-bookkeeper.png-172.2kB
Figure 2: Typical installation of Apache Pulsar

This architectural pattern achieves the separation of storage and pub-sub traffic services, and can gain several advantages. Brokers can be "stateless", so load balancing and traffic shifting can be achieved at a lower cost. This architecture has proven to be the key to successful multi-tenancy capabilities (more on multi-tenancy in this blog post ). This is also the key to enabling Apache Pulsar to support both synchronous and asynchronous geo-replication.

Synchronous Geo-Replication in BookKeeper

Synchronous geo-replication is actually implemented for Pulsar at the storage layer by Apache BookKeeper. A typical Pulsar installation will usually provide synchronous geo-replication as shown in Figure 3.

pulsar-synchronous-geo-replicated.jpeg-140.3kB
Figure 3: Pulsar environment using synchronous geo-replication

A Pulsar environment using synchronous geo-replication consists of a global Zookeeper (a set of ZooKeeper environments will run across multiple data centers), a Bookie cluster running in multiple data centers, and a Broker also running in multiple data centers cluster. In addition, BookKeeper needs to be configured to use a region-aware placement strategy , which will be used by Pulsar Broker to store data across data centers, while providing consistency guarantees for write operations (such as writing to at least two data centers before confirming).

pulsar-synchronous-geo-replicated-with-failure.jpeg-172.8kB
Figure 4: How a synchronous geo-replicated Pulsar environment responds to a data center failure

In the event of a data center failure, a synchronous geo-replicated Pulsar environment can continue to function normally. The applications running in it are basically unaffected, and whether new data centers are added or old data centers are decommissioned, these operations are transparent to the application. Data centers can even be added or removed without downtime while applications are running. This type of configuration is ideal for mission-critical applications that can tolerate higher latency.

Asynchronous Geo Replication in Pulsar

In contrast, an asynchronous geo-replicated Pulsar cluster consists of multiple physical cluster environments located in multiple data centers. Pulsar Broker replicates data between these clusters asynchronously. Figure 5 shows an asynchronous geo-replicated Pulsar environment. It is recommended to compare it with the synchronous geo-replication environment in Figure 3.

pulsar-asynchronous-geo-replicated.jpeg-178.6kB
Figure 5: Asynchronous Geo-Replicated Pulsar Environment

In the case of asynchronous geo-replication, if a new message is generated for a Pulsar topic, the message is first persisted in the local cluster, and then asynchronously replicated to the remote cluster. In most cases, if the network connection is normal, the message can be copied immediately while being consumed locally. Typically, end-to-end delivery latency is determined by the network round-trip time (RTT) between data centers. Applications can create Producers and Consumers in any cluster, even if the remote cluster is inaccessible (for example, during a network partition operation).

In Pulsar, asynchronous geo-replication can be enabled on a per-asset (per-tenant) level. This means that asynchronous geo-replication between clusters can only be enabled if an asset has been successfully created and access to all clusters is allowed. Although geographical replication must be enabled for each asset due to permissions, it can be managed at the namespace level. That is, if a tenant has access to datacenters A, B, C, and D, a namespace can be created between datacenters A and B for geo-replication, and another namespace for geo-replication of C and D, Also create a third namespace for full-mesh replication between A, B, C, and D.

Pulsar offers tenants tremendous flexibility in the customization of replication strategies. This means that applications can set up master-slave replication, active-active bidirectional replication, and fully-connected replication between multiple data centers (Figure 5 shows a fully-connected replication with 3 data centers). In addition, the replication operation can be automated by the Pulsar Broker, completely transparent to the application. Other pub-sub messaging systems often require additional complex processes to mirror messages between data centers. Pulsar's geo-replication can be enabled, disabled or changed at any time during operation (for example, from master-slave to active-active). two-way replication), these operations only require a single administrative command. For details on using asynchronous geo-replication in Pulsar for failover, failover, and other best practices, please wait for the next article.

Yahoo's Multi-Data Center Replication

Since 2015, Yahoo has deployed Pulsar in more than a dozen data centers around the world, using a fully connected asynchronous geo-replication configuration. This geographical replication mechanism is mainly used for key business services, such as mail, finance, Gemini advertising, Sherpa (Yahoo's distributed key-value service), etc. The entire system targets more than 1.4 million topics and replicates tens of billions of messages every day information.

At this scale, the entire system must be flexible enough and provide the necessary tools to help users manage the replication process efficiently. From adding or subtracting regions to replica sets of namespaces, to fully understanding where the data will be stored, how much data will be replicated, and the ability to monitor why the replication process is so slow, it must be fully considered.

Last but not least, with geo-replication, the likelihood of creating network partitions or degrading network performance between different data centers is much higher than when using only one data center. Therefore, the messaging and storage components used must also be able to withstand longer build backlogs, which may vary from hours to days. Equally important, when network issues are resolved, the backlog must be processed faster than new messages arrive, without affecting traffic.

in conclusion

Apache Pulsar leverages the scalable streaming storage of Apache BookKeeper, a messaging system that supports both synchronous geo-replication (with Apache BookKeeper) and asynchronous geo-replication (configured at the broker level). This article introduces two types of geo-replication mechanisms commonly used in data systems, and discusses the differences and necessary trade-offs between the two. Pulsar supports both geo-replication methods simultaneously through different mechanisms. Hope this article helped you better understand Apache Pulsar and its geo-replication capabilities. The next article will introduce several common patterns and practices for using asynchronous geo-replication in Apache Pulsar.

If you are interested in Pulsar, you can also join the Pulsar community by:

For general information about the Apache Pulsar project, please visit the official website: http://pulsar.incubator.apache.org/  and follow the Twitter account @apache_pulsar .

Author : Sijie Guo, read the original English text : Geo-replication in Apache Pulsar, part 1: concepts and features

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324465029&siteId=291194637