Data replication: a key component in building large-scale distributed systems

Data replication is critical to building reliable, large-scale distributed systems. In this issue, we'll explore common replication strategies and the key factors in choosing the right one.

In this issue, we will discuss it using a database as an example. Note that replication does not only apply to databases, but also to cache servers such as Redis and application servers for critical in-memory data structures.

So, what is replication? It is a method of copying data from one place to another. We use this to ensure our data is available where and when it is needed. It helps us improve data durability and availability, reduce latency, and increase bandwidth and throughput.

4ee9ec91c2d2527348e6e3d2dafe4dc8.png
  But choosing a replication strategy isn't always easy. There are different strategies, each with its own advantages and disadvantages. Certain strategies may be better suited for specific use cases, while other strategies may be better suited for different situations.

In this issue, we'll explore the three main replication strategies: Leader-Follower, Multi-Leader, and Leaderless. We'll explain in detail what each strategy is, how they work, and in which situations they are most effective. We will discuss the trade-offs involved with each strategy so that we can make an informed choice about the strategy that is best for our system.

So, let’s dive into the world of data replication.

Copy Introduction

Let's consider at a high level why replication is necessary. As we mentioned before, we will always use a database as an example, but the same applies to other types of data sources.

Improve durability

Improving durability is probably the most important reason for data replication. When a single database server fails, it can result in catastrophic data loss and downtime. If data is replicated to other database servers, the data will be preserved even if one server goes down. Certain replication strategies, such as asynchronous replication, may still result in small amounts of data loss, but overall durability is greatly improved.

d6f22972aae3e1ddaa32a25f255a611f.png
 

You may be thinking: Aren't regular data backups enough to ensure durability? Backups certainly allow data to be recovered after a disaster such as a hardware failure. But relying solely on backups has durability limitations. Backups are scheduled, so some data loss may occur between backup cycles. Restoring data from backups is also slow and can cause downtime. Combined with backup, replication provides additional durability by eliminating (or greatly reducing) data loss windows and allowing faster failover. Backup and replication work together to provide data recovery and minimize downtime.

Improve usability

Another key reason to replicate data is to improve the overall availability and resiliency of your system. When a database server goes offline or is under heavy load, it can be challenging to keep your application running smoothly.

Simply redirecting traffic to a new server is not a trivial matter. The new node needs to already have a nearly identical copy of the data in order to take over quickly. Switching databases in the background requires careful failover orchestration while maintaining continuous application and user uptime.

Replication enables seamless failover by keeping a standby server with an up-to-date copy of the data. Applications can redirect traffic to replicas in the event of a problem, minimizing downtime. A well-designed system automatically handles redirection and failure recovery through monitoring, load balancing, and replication configurations.

Of course, replication has its own overhead and complexities. But without replication, a single server failure can mean lengthy downtime. Replication maintains availability in the event of failure.

814a02c86f2fd2a733b716c32f1ae91e.png
 

Increase throughput

Replicating data between multiple database instances also increases overall system throughput and scalability by distributing load across nodes.

For a single database server, there is a maximum threshold of concurrent reads and writes that it can handle before performance degrades. By copying to

With multiple servers, application requests can be distributed among replicas. More replicas means more ability to handle the load.

This sharding of requests distributes the workload. It allows the overall system to maintain much higher throughput than a single server. Capacity can be expanded by adding additional replicas as needed.

Replication itself has an associated overhead and can become a bottleneck if not managed properly. Factors such as inter-node network bandwidth, replication lag, and write coordination should all be monitored.

But proper replication configuration allows horizontal scaling of read and write capacity. This enables massive aggregate throughput and workload scalability beyond the limitations of a single server.

e61ef1fbdb71f5468eb567d2b32ce790.png
 

Reduce latency

Data replication also reduces latency by placing data close to users. For example, replicating a database to multiple geographic regions brings copies of the data close to local users. This reduces the physical network distance over which data must travel compared to a single centralized database location.

Shorter network distance means lower transmission latency. As a result, users' read and write requests will see faster response times when requests are routed to a nearby replicated instance rather than routed to a distant instance. Multi-region replication enables localized processing and avoids the high latency of cross-border or intercontinental network routing.

Note that distributing replicas across regions introduces complexities such as replica synchronization, consistency, and conflict resolution with concurrent multi-site updates. Solutions such as consistency models, conflict resolution logic, and replication protocols help manage this complexity.

Where applicable, multi-region replication provides major latency improvements for geographically distributed users and workloads through localized processing. Lower latency also improves user experience and productivity.

8b90bdecc41bf6429b581658624197dd.png
 

Guess you like

Origin blog.csdn.net/weixin_37604985/article/details/132727330
Recommended