Distributed service high availability implementation: replication | Jingdong logistics technical team

1. Why do you need to copy

We can consider the following questions:

  1. When the data volume, read or write load has exceeded the processing capacity of the current server, how to achieve load balancing?

  2. How do you want to keep working if a single server goes down?

  3. When the users of the service are all over the world and hope that there will be no large delay when they access the service, how can the user's interactive experience be unified?

These problems can actually be solved by "replication" : replication, that is, saving the same copy on different nodes to provide data redundancy. If some nodes are unavailable, the remaining nodes can still provide data services. These nodes may be deployed in different geographical locations to improve system performance. The solutions to the above three problems are as follows:  

  1. Adopt shared-nothing architecture to expand horizontally , distribute data to multiple servers, perform effective load balancing , and improve service scalability   
  2. Deploy multiple servers, and when one goes down, other servers can take over at any time to achieve high availability of services 
  3. Deploy services in multiple geographic locations so that users can access them nearby, avoiding large delays and unifying user experience

copy overview.png

2. Single master replication

Single-master replication  is the most common replication solution at work. Each node that stores a copy of the database is called a replica . Every write operation to the database needs to be propagated to all replicas, otherwise the replica data will be inconsistent. It works as follows: 

  • One of the replicas is designated as the leader , also known as the master library , when a client wants to write to the database, it must send the request to the leader 
  • The other replicas are called followers , also known as slaves or read-only replicas , and whenever the leader writes data to local storage, it pushes the data changes to all followers in the form of replication logs or change streams , and followers write in the same processing order as the leader          

2.1 Data synchronization between nodes

Data synchronization is divided into synchronous replication and asynchronous replication . The advantage of synchronous replication is that the slave database can guarantee consistent data with the master database. When the master database fails, these data can be found on the slave database, but its disadvantages are also obvious: the master The library needs to wait for the data synchronization result from the slave library. If the synchronization slave library does not respond, the main library will no longer be able to process new write operations, but will enter a blocked state.   

In the scenario of more reads and less writes , we usually increase the number of slave nodes to load balance the read requests, but it is impractical and unreliable if all slaves are replicated synchronously at this time, because of the failure of a single node Or network interruption will affect the writing of data.  

In fact, when the database enables synchronous replication, it usually means that one slave library is synchronous replication, and the other slave libraries are asynchronous replication. When the synchronous slave library fails, the asynchronous replication copy will be changed to synchronous replication, which ensures that there are at least two nodes. Having the latest copy of the data, this configuration is also known as semi-synchronous . 

Typically, leader-based replication is configured to be fully asynchronous . As shown in the figure below, when user 1234 modifies information, there is a delay in synchronizing from the master database to the slave database. picture_url 

asynchronous copy.png

This means that if the master database fails at this time and the data that has not been copied to the slave database will be lost, it will not guarantee that the write is durable even if the client has requested confirmation of success. If the data is fetched, the old data will be read, which will give the user the feeling that the previous write is lost. This corresponds to the problem of read-write consistency , which we will explain in detail later. Follower 2   

However, the actual production situation is based on asynchronous replication, which shows that strong consistency is not a necessary guarantee, and the demand for ensuring system throughput is higher. Because under this mechanism, even if the slave library is far behind, the master library does not have to wait for the completion of writing from the library to return data writing success. Afterwards, the slave library will slowly catch up and be consistent with the master library. This weak consistency guarantee is called final consistency .   

2.2 Replication delay problem

From the previous section, we know that there is a delay in asynchronous replication from writing to the master library to copying to the slave library, so a series of problems will arise. Here we explain these problems in more detail.

  • After the writing is completed, the master node fails, but the slave node has not completed data synchronization

When the master node fails, failover is required , and a slave library is promoted to the master library. The best candidate for the master library is usually the slave library with the latest data copy (this principle is followed by zookeeper’s transaction ID comparison process), so that the new master library To continue serving the client, other slave libraries perform data synchronization from the new master library node. 

If the new master node has not completed data synchronization before the failure of the old master node at this time, the usual practice is to discard the unreplicated data of the original master node, and data loss will occur at this time .  

And when the old master library is restored, it needs to be aware of the existence of the new master library and make itself a slave library. If multiple nodes in the cluster think that they are the master node, that is, the "split brain" phenomenon is very dangerous: because multiple master nodes can perform write operations, but there is no conflict resolution mechanism, the data may be destroyed.  

When zookeeper has a brain split, the size of the judgment (it will increase after the failover completes a new round of elections ) is used to make the slave node reject the request of the old master node to ensure that the data is not destroyed. epoch epoch


  • Read-after-write consistency (read-to-write consistency)

read-after-write consistency.png

As shown in the figure above, if the user requests to view the data immediately after writing, the new data may not have reached the read-only slave library, and it seems that the data just submitted is lost. This situation can be resolved in the following ways

  • For the content that may be modified by the user , it is always read from the main library, which requires a way to know whether the user has modified some data without querying. For example, the personal information of a social network is usually modified by the individual, so it can be defined that the profile information of oneself is always read from the main library, and the information of other people is obtained from the library.  
  • If most of the content in the application can be modified by the user, then read scalability  is not effective if most queries are read from the main database . In this case, you can ensure read scalability by recording the time of the last update, such as querying from the main library within one minute after the update, and then reading from the library.
  • The client records the latest write timestamp. The system needs to ensure that when the slave library processes the user's read request, the change of the timestamp has been recorded in the slave library. If the record does not exist in the current slave library, Then you need to read from other libraries, or wait for the data to be synchronized from the library

  • monotonic read

monotonous reading.png

As shown in the figure above, user 1234 wrote a comment, and user 2345 received the first request when reading comments added by other users. At this time, the slave library has completed data synchronization, so the comment can be read. But the second request arrived , but the data synchronization was not completed, so the comments read before could not be seen, and the phenomenon of "backward time" appeared . Follower1 Follower2 Follower2   

Avoiding this phenomenon requires monotonic reading , that is, when a user reads newer data, he will not read older data. The way to achieve monotonic reading is to make the same user's read requests go to the same replica node , and we can assign replicas based on the hash of the ID instead of random assignment.  

2.3 Data synchronization of the new slave library

Usually, in order to enhance the read scalability of the system , new slave libraries are added. However, when the new slave library is synchronizing data with the main library, it is usually not enough to simply copy the data file to another node, because the data is always changing, and the current data file cannot contain the full amount of data, so in general The process is as follows: 

  1. Obtain a consistent snapshot of the main library at a certain moment, and copy the snapshot to the new slave library node
  2. The data changes that occur after the slave library is connected to the main library and pulls the data snapshot, which requires that the snapshot be associated with the exact location of the copy log of the main library. Mysql is associated through the binary log coordinates binlog coordinates 
  3. After processing the data changes after the snapshot, it is said that it has caught up with the main library, and now it can process the data changes of the main library in time

If the slave library fails , the above steps will be executed after the slave library is restarted. The last transaction processed before the failure can be known through the log, and all data changes during the disconnection period of the slave library are requested through this record, slowly catching up with the main library .  2,3 

3. Multi-master replication

Based on the replication of a single master node, each write request must go through the data center where the master node is located , so as the number of write requests increases, the limitation of poor scalability of the single master node will appear, and users all over the world It is necessary to request to the master node to write, and there may be a problem of long delay. In order to solve these problems, extending under the single master node architecture is naturally multi-master node replication . In this case, each master node is a slave library of other master nodes. 

Typically, multi-master replication is not used to increase the scalability of a single-master node, but is solved by data partitioning. Because the complexity caused by the former has exceeded the benefits it can bring, but in some cases, multi-master replication can also be used.

The multi-master replication architecture of multiple data centers is shown in the following figure:

multi-master replication.png

The copies of the database are scattered in multiple data centers, each data center has a master library, and each data center is a master-slave replication. The write request of each data center will be processed in the local data center and then synchronized to other data The master node of the center, so that the network delay between data centers becomes transparent to users, which means that the performance may be better and the tolerance to network problems is higher ; multiple data centers are deployed in different geographical locations On the other hand, the user experience is better; if the local data center fails, the request can be transferred to other data centers, and the service can continue to be provided after the local data center recovers and the replication catches up with the progress. 

3.1 Application scenarios of multi-master replication

  • Apps that continue to work when disconnected

If the mobile phone and computer you use are in the same ecology, then under normal circumstances, the modification of the content of the memo can be synchronized between the devices. From an architectural point of view, each device is equivalent to a data center, and each data center can write, which conforms to the multi-master replication model. The network between data centers is extremely unreliable. When the mobile phone is offline and the memo is modified on the computer, then when the mobile phone is connected to the Internet again, data synchronization between devices needs to be completed. This is the process of asynchronous multi-master replication.


  • Collaborate on documents online

When a user edits a document, the changes will be immediately replicated to the server and any other users who are using the document asynchronously. The document operated by each user is equivalent to a data center. This situation is similar to our above There are similarities in the above-mentioned modification of the memo on the offline device. However, in this case, in order to speed up collaboration and improve the user experience of the document, it is necessary to solve the problem of writing conflicts caused by simultaneous editing.

3.2 Resolving write conflicts

Although we mentioned above that multi-master replication can bring many benefits (scalability brought by multi-master, better fault tolerance mechanism and reduced delay caused by geographical location), the accompanying complex configuration and write conflict problems are also We need to face it.    

As shown in the figure below, user 1 modifies the title to B, and user 2 modifies the title to C, then a write conflict will occur at this time. It is difficult for us to tell whose result is appropriate to designate as the final modification result, but we still The values ​​of the multi-master database had to be converged to a consistent state.

Multi-Master Replication Conflict.png

Last write wins (LWW, last write wins)  is a more commonly used method. We can add a timestamp or a unique ID to each request, select the larger value as the final result, and discard other values, but this This kind of situation is easy to cause data loss, such as the unreliable clock problem in distributed services , the value written later may carry a timestamp earlier, then in this case, the value we expect to be written will be written The result is discarded.  

Another method is to assign an ID number to each master library, and the master library with a higher ID number has a higher priority, but this will also cause data loss problems.

If you don't want data loss to occur, you can combine these values ​​together in some combination. The modification of the title in the above figure is an example, the title modification result can be spliced ​​into B/C, but in this case, the user needs to correct the result. Similar to this approach, you can also consider displaying and recording all data modification conflicts, and then prompt the user to make modifications.

Version vectors  are also a way to resolve conflicts. Taking caching as an example, we maintain a version number for each key, and each time we write, we read it first, and all the values ​​​​read before must be merged together, and the deleted value will be marked (tombstone), so that It can avoid the deleted value still appearing after the merge is completed. The version number is incremented after the write is complete, and the new version number is stored with the written value. When multiple copies accept writes concurrently, each copy also needs to maintain a version number, and each copy increases its own version number when processing writes. The collection of version numbers of all replicas is called a version vector . The version vector is passed back and forth between the client and the server with reads and writes, and allows the database to distinguish between overwrite writes and concurrent writes. Version vectors ensure that reading from one replica and subsequently writing back to another replica is safe .  

However, although we have introduced so many ways to resolve conflicts, in fact avoiding conflicts is the best way. For example, we can ensure that all writes to a particular record go through the same master, so there will be no conflicts.  

Understanding of concurrency : If it is in a single service, we can use the timestamp to judge that two events occur at the same time; if it is in a distributed system, because the distributed system has an unreliable clock problem, in the actual system It is difficult to tell whether two events are happening at the same time, so the overlapping of concurrency in literal time does not matter . In fact, concurrency emphasizes whether two events are aware of each other's existence . If neither is aware of the other's existence, that is, neither event occurs before the other, then the two events are concurrent, and they exist Concurrent write conflicts that need to be resolved .    

5. Masterless replication

Masterless replication adopts a different replication mechanism from single-master and multi-master replication: it does not have the difference in responsibilities between the master library and the slave library, but abandons the concept of the master library, and each database node can handle write requests, so it is applicable It is suitable for high-availability, low-latency, and application scenarios that can tolerate occasional reading of stale values.  

Another advantage of this replication mode is that there is no failover. When a node goes down, the application will forward the request to other normal working nodes. After a downed node reconnects, the node can catch up on missed writes in two ways:

  • Read repair : suitable for frequently read values. When the client acquires multiple nodes in parallel, if it detects stale values, it will overwrite the stale values ​​with the new values ​​read
  • Anti-entropy : Starts a background process that constantly looks for data differences between replicas and copies any missing data from one replica to another

Each database node of masterless replication can handle read and write requests, but it is not considered to be written successfully after a single node is written, or the value is considered to be the result of a read when a single node is read. Its reading and writing follow the quorum principle , which is similar to the fault-tolerant consensus algorithm used by zookeeper to process write requests. 

In general, if there are n replicas, each write must be acknowledged by w nodes to be considered successful, and each read must query r nodes. As long as , we can expect to get the latest value on read, since at least one node is up-to-date in r reads, reads and writes following these r and w values ​​are called quorum reads and writes. A common configuration is to configure n (the number of nodes) to be an odd number, and set up rounding, which ensures that the node sets written and read must overlap, so at least one node must have the latest value among the nodes read . w + r > n w = r = (n + 1) / 2 

As shown in the figure below, user 1234 will send the write request to all three database copies, and consider the write to be successful when two of the copies return success, ignoring the fact that the downtime copy missed the write; user 2345 When reading data, the request is also sent to all replicas, and the latest value among them is regarded as the result of the read.

Masterless replication read and write.png

Each mode of replication has advantages and disadvantages. Single-master replication is more popular, it is easy to understand and does not need to deal with conflicts (writes are only processed by the master node). However, in the event of node failure or large network delays, multi-master replication and masterless replication can be more robust, but they can only provide weaker consistency guarantees.


shoulders of giants

Author: JD Logistics Wang Yilong

Source: JD Cloud developer community Ziqishuo Tech

It is infinitely faster than Protocol Buffers. After ten years of open source, Cap'n Proto 1.0 was finally released. The postdoctoral fellow of Huazhong University of Science and Technology reproduced the LK-99 magnetic levitation phenomenon. Loongson Zhongke successfully developed a new generation of processor Loongson 3A6000 miniblink version 108. The world's smallest Chromium core ChromeOS splits the browser and operating system into an independent 1TB solid-state drive on the Tesla China Mall, priced at 2,720 yuan Huawei officially released the security upgrade version of HarmonyOS 4, causing all Electron-based applications to freeze AWS will begin to support IPv4 public network addresses next year Official release of Nim v2.0, an imperative programming language
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10092358
Recommended