Distributed consensus algorithm for cross-spatial domain data management: current situation, challenges and prospects

Distributed consensus algorithm for cross-spatial domain data management: current situation, challenges and prospects

Li Youming1, Li Tong1,2,张大方1,Dai Longchao1,2, 柴云鹏1,2

1 School of Information, Renmin University of China, Beijing 100872

2 Key Laboratory of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing 100872

Abstract:With the rapid development of the digital economy, and the foundations such as "National Integrated Data Center" and "Eastern Digital and Western Calculation" The continuous improvement of facilities and the general trend of data element circulation have gradually transformed data services from data management for a single spatial domain to data management for cross-spatial domains. Cross-domain data management requires data consistency through distributed consensus algorithms. However, existing distributed consensus algorithms only consider the situation of a single data center and do not consider the uncertainty of network communication between data centers. As a result, they face long log synchronization delays and low system throughput in cross-spatial domain scenarios. And other issues. This paper systematically sorts out the current status and new challenges of distributed consensus algorithms across spatial domains, and looks forward to the technical routes to solve these challenges.

Keywords: Cross-spatial domain data management; distributed consensus algorithm; log replication; leader election

f0c9c22375754bbcf0aa55bd8638f58a.jpeg

Paper citation format:

Li Weiming, Li Tong, Zhang Dafang, et al. Distributed consensus algorithm for cross-spatial domain data management: current situation, challenges and prospects [J]. Big Data, 2023, 9(4): 2-15.

LI Q M, LI T, ZHANG D F, et al. Distributed consensus algorithms for crossdomain data management: state-of-the-art, challenges and perspectives[J]. Big Data Research, 2023, 9(4): 2-15.

49d20d3b1019a8dc30d0070163f40f89.jpeg

0 Preface

Distributed consensus algorithm is an algorithm used by distributed systems to maintain consistency among multiple copies on multiple nodes. This algorithm allows a distributed system under a multi-copy architecture to appear to the outside world as having only one logical copy, while ensuring that the read and write operations of the system meet atomicity. Furthermore, the application layer can ignore the synchronization problem between multiple data copies at the bottom of the distributed system. In addition, when a small number of nodes in the distributed system are abnormal, the distributed consensus algorithm can also ensure that the entire system can still provide correct services like a stand-alone system, ensuring high availability of the system.

The distributed consensus algorithm contains two core modules: log replication and leader election. Log replication means that after receiving data, a node in the system will copy the data to one or more other nodes. Data backup and system high availability can be achieved through log replication. The leader is a special node in the system, elected by all nodes. If a node receives the votes of a majority of nodes, it can be elected as the leader. The leader is responsible for communicating with clients and coordinating the replication and synchronization of data replicas on other nodes. Introducing the leader mechanism can simplify the algorithm and make it easier to implement in engineering.

At the end of 2020, the country officially issued the "Guiding Opinions on Accelerating the Construction of a National Integrated Big Data Center Collaborative Innovation System", which states: "By 2025, data centers nationwide will form an integrated, green and intensive infrastructure with reasonable layout. pattern." The "National Integrated Big Data Center Collaborative Innovation System Computing Power Hub Implementation Plan" released in 2021 further emphasizes accelerating the implementation of the "Eastern Data and West Computing" project and improving the level of cross-regional computing power dispatching. This series of major initiatives provides important infrastructure for cross-domain data sharing and collaboration in the digital world, and provides basic conditions for cross-domain data management. Data management is shifting from single data center management to cross-domain sharing and collaborative management. Therefore, distributed databases will be increasingly used and deployed in cross-spatial domain scenarios.

Unlike deployment within a single data center, deploying distributed databases across spatial domains introduces many new challenges. First, the absolute value of the cross-spatial domain network delay increases, which will cause a large amount of network overhead when the algorithm performs log replication. When the nodes of the distributed system are in the same data center, the network latency between nodes is low, and the time-consuming replication operation will not have a major impact on system performance. However, when data is replicated between multiple data centers that are physically located on different continents, the communication latency across each data center can be as high as tens or even hundreds of milliseconds, which is tens of times the communication latency within a single data center. Even hundreds of times. Secondly, the difference in network delay between nodes across spatial domains cannot be ignored. In the context of cross-spatial domains, different nodes are deployed at different distances, and the network delays between different nodes are also different. Therefore, random election of leader nodes according to the previous consensus algorithm design idea may make the algorithm inefficient. Finally, there are dynamic changes in cross-spatial domain network delays. Cross-spatial domain networks are not as stable as networks within a single data center, and their network communication delays are uncertain, such as network fluctuations. This will also create new challenges for log replication and leader election of the consensus algorithm.

In view of the new challenges faced by the above-mentioned distributed consensus algorithm in cross-space domains, optimization can be carried out from two aspects: log replication and leader election of the algorithm. In order to optimize log replication, methods such as sending data in time and sending data in time can be used. In terms of leader election, optimization can be carried out through strategies such as electing the optimal leader and the leader's initiative to yield. These optimization measures can improve the performance and reliability of distributed consensus algorithms in cross-spatial domain environments.

1 Introduction to distributed consensus algorithm

The goal of the distributed consensus algorithm is to achieve a consensus among a group of distributed processes, that is, all nodes believe that the value is ultimately consistent at a certain point in time. This process usually involves operations such as message passing, data synchronization, and node status updating. The implementation of distributed consensus algorithms is crucial to building reliable, secure and high-performance distributed systems. Under the organization and coordination of the distributed consensus algorithm, the distributed system can provide a single logical copy abstraction for upper-layer applications.

Distributed consensus algorithms are divided into Byzantine distributed consensus algorithms and non-Byzantine distributed consensus algorithms. Byzantine-like distributed consensus algorithms need to consider possible malicious behaviors in the system and implement appropriate fault tolerance for these behaviors. There have been some studies on Byzantine-like distributed consensus algorithms across spatial domains. For example, Zamani et al. first proposed RapidChain, a public blockchain protocol with Byzantine fault tolerance based on sharding; Amiri et al. proposed a method for edge computing networks. The optimized permissioned blockchain system Saguaro reduces the cost of wide-area communication by making full use of the hierarchical structure characteristics of edge computing networks; Ziziphus proposed by Amiri et al. divides servers with Byzantine fault tolerance into several Byzantine fault tolerance domains , each domain is used to process transaction requests generated by neighboring clients and reduce the number of cross-domain transactions.

Non-Byzantine consensus algorithms assume that nodes in the system cannot behave maliciously and that these nodes can only be operated by the owner or operator of the system. It should be noted that all consensus algorithms involved in this article are non-Byzantine distributed consensus algorithms.

Distributed consensus algorithms allow multiple nodes to agree on a certain value through voting. In the distributed consensus algorithm, a group of independently running processes or nodes communicate and vote. Once the voting result is agreed by more than half of the nodes or processes, the consensus process can be considered completed. This also involves issues such as node disconnection, node failure, and message retransmission. The current mainstream distributed consensus algorithms include Paxos/Multi-Paxos, ZAB, Raft and other algorithms. These algorithms have mature industrial implementations. The following article mainly introduces the algorithm from the aspects of log replication and leader election in the consensus algorithm.

1.1 Paxos/Multi-Paxos algorithm

In 1990, Lamport proposed the Paxos algorithm and gave a completeness proof that can guarantee consistency under any circumstances. This method is currently one of the algorithms that can effectively solve distributed consensus. However, because Paxos is difficult to understand and implement in engineering, Lamport et al. improved the Paxos algorithm and proposed the Multi-Paxos algorithm. Different from Paxos, there is a leader process or node in the Multi-Paxos algorithm, and all requests are processed and forwarded by the leader. The main contents of the Multi-Paxos algorithm are log replication and leader election.

1.1.1 Log replication

The log replication process of Multi-Paxos is shown in Figure 1. The client sends the request to the leader in the algorithm, the leader forms a proposal locally and sends the proposal to the followers. Specifically, the leader sends an acceptance message to other replicas. The accept message contains the proposal number and proposal content. After other replicas receive the proposal, if they have not responded to a proposal with a larger proposal number before, then the replica receives the proposal and returns a reply message. After receiving the response success message from more than half of the replicas, the leader executes the proposal locally and sends a success reply message to the client, and then sends a commit log (commit) message to other replicas to enable the other replicas to execute the proposal locally.

c735a390ab8cafa127f23075df797dbb.jpeg

Figure 1 Multi-Paxos algorithm

1.1.2 Leader election

During the election process of the Multi-Paxos algorithm, each replica will generate a term number (term), which represents the sequence number of the current election cycle. The replica broadcasts the term number to other replicas and waits for their replies. If more than half of the replicas respond in agreement, the replica will become the new leader and start performing the corresponding tasks. At the same time, the leader election is triggered through a timeout mechanism, that is, if a replica does not receive a reply from another replica within a certain period of time, a new round of election process will be initiated.

1.2 ZAB algorithm

The ZAB algorithm is a distributed consensus algorithm proposed by Yahoo. The ZAB algorithm is similar to the Multi-Paxos algorithm and is mainly divided into two parts: log replication and leader election.

1.2.1 Log replication

Similar to Multi-Paxos, all write requests in the ZAB algorithm are processed by the leader. As shown in Figure 2, after the leader receives the client request, it encapsulates the request into a transaction (proposal) and assigns a number to the transaction. The transaction number includes the term and the transaction number within the term. The leader then sends the transaction to all replicas. After the replica receives the transaction, it writes the transaction to the local disk and then sends a reply message to the leader. If more than half of the replicas respond with successful replication, the leader commits the transaction and sends a message to all replicas to commit the transaction.

c124ab5ab676e7e70431111f6524fcf5.jpeg

Figure 2 ZAB algorithm

1.2.2 Leader election

In a distributed system, nodes may go down at any time. When the leader goes down or the network is offline, the ZAB algorithm needs to re-elect a new leader. In the ZAB algorithm, the process has three states: leader state, follower state, and election state. Under normal circumstances, replicas send heartbeat messages to the leader. When the leader does not receive heartbeat messages from more than half of the replicas within a period of time, the leader transitions its state to the election state. When the replica discovers that the leader has entered the election state, the replica also enters the election state and converts itself into the candidate role.

The candidate sends a request voting message to other replicas. The message includes the replica process number and the last transaction number local to the replica process. After the replica receives the candidate's voting message, if the transaction number of the candidate process is larger, the replica process will vote for it. When more than half of the processes vote for the same candidate, the candidate becomes the new leader.

1.3 Raft algorithm

The Raft algorithm is a distributed consensus algorithm proposed by Stanford University in 2014, aiming to design an easy-to-understand distributed consensus algorithm. The Raft algorithm has the same function as the Paxos algorithm and can maintain the consistency of data on multiple copies even in the case of network partitions and node failures. Since Raft was proposed, it has been widely adopted by many practical distributed systems such as TiDB, Polardb, and CockroachDB. The Raft algorithm assigns a role to each node in the system. There are three roles: leader, follower and candidate.

The leader is mainly responsible for receiving client commands and forwarding the received commands to followers. Followers receive and save commands sent by the leader and respond to the leader. Candidates are transformed from followers. When there is no leader in the system, a follower will transform its role into a candidate and send voting request information to all other nodes to elect a new leader.

1.3.1 Log replication

The log copy process of Raft algorithm is shown in Figure 3. Once a node in the system is elected as the leader, it starts receiving requests from clients. The leader adds these requests as log entries to the local log and then issues remote procedure calls (RPC) to other servers in parallel to replicate these log entries. Only after these log entries have been successfully replicated to a majority of server nodes, the leader applies these log entries to its state machine and returns the execution results to the client. If a follower fails, runs slowly, or experiences problems such as packet loss, the leader will keep retrying until all followers have eventually replicated all log entries.

beede299b5a2b230433403e7ab068ab8.jpeg

Figure 3 Raft algorithm

In the actual log copy process, in order to improve performance, the Raft algorithm supports batch and pipeline technologies. Specifically, the leader does not forward every received request immediately, but instead combines multiple requests into a batch and sends them to the followers. In addition, the leader will not wait for the results of the previous batch to return before continuing to send the next batch, but will send multiple batches continuously. The batch processing technology of the Raft algorithm is combined with the heartbeat mechanism. The leader will send the log to other followers while sending the heartbeat signal. When the leader has no new logs to send, it will send empty heartbeat packets to other followers.

1.3.2 Leader election

In the initial state of the system, the roles of all nodes are followers. Each follower has a clock whose value is randomly generated to represent the time it takes for the follower to become the leader. When the clock countdown of a follower ends, the node starts the leader election process and becomes the leader after winning a majority of votes. The leader then periodically sends heartbeat signals to the followers to reset the followers' clocks and prevent them from starting the election process again. At the same time, when an election is conducted, the node will increase its term number by 1. The term number is a variable maintained locally by each node, and its design is inspired by the terms of presidential elections. When a follower becomes a candidate and starts the election process, it calculates the local term number and increments it by one.

To sum up, the Multi-Paxos algorithm, ZAB algorithm, and Raft algorithm have many similarities. The main contents of the algorithm are log replication and leader election. In fact, Multi-Paxos algorithm, ZAB algorithm and Raft algorithm are all Paxos-like algorithms, which have stricter restrictions and more detailed and standardized descriptions based on Paxos.

2 New challenges faced by distributed consensus algorithms across spatial domains

When designing the aforementioned distributed consensus algorithm, only the situation where distributed nodes are in the same data center is considered. However, the technical architecture of a single data center currently faces many problems. The first is the performance issue. With the increase in business volume, the infrastructure of a single data center can no longer cope with the challenges brought by the growth in business volume. Secondly, there is the issue of fault tolerance. If a single data center fails due to force majeure (such as earthquakes, explosions, etc.), it will lead to business and system unavailability, causing serious losses to the company's image and revenue. At the same time, for the architecture of a single data center, when users need to access across regions, they will encounter large delays, which will affect the user experience. Therefore, many companies need to deploy distributed systems across regions, and distributed consensus algorithms also need to work in cross-spatial domain scenarios. However, in cross-spatial domain scenarios, consensus algorithms face new problems and challenges. As shown in Figure 4, in the case of cross-border links, inter-provincial links, and internal link networks in data centers, changes in network delay cannot be ignored. The network delay of inter-provincial links is tens of milliseconds, while the network delay of cross-border links may exceed one hundred milliseconds. At the same time, there are differences in network delays between different nodes, and the network delays between nodes will change dynamically. As shown in Figure 4, in cross-border links, network delay may suddenly increase to more than 200 milliseconds, or may suddenly decrease to more than 100 milliseconds, which poses new challenges to the correctness and performance of the consensus algorithm. . Generally speaking, WAN data transmission is uncertain, and this uncertainty is reflected in the following three aspects.

266801d4077a74452236ee306bc941e1.jpeg

Figure 4 Network latency comparison

2.1 The absolute value of cross-spatial domain network delay increases

In a single data center, the network latency between nodes is low, and the communication latency is usually a few milliseconds. However, in cross-regional scenarios, the network delay between nodes will increase significantly, generally hundreds of milliseconds or even seconds. Some experimental tests show that the network latency of computer rooms in the same computer room or in the same region is usually at the millisecond level, while the cross-regional access latency increases by an order of magnitude.

The increase in network latency will have a huge impact on the performance of distributed consensus algorithms. Taking the Raft algorithm as an example, during the log copy process, nodes or processes in the distributed system need to perform a large number of data copy and synchronization operations with other nodes. When network latency increases, the efficiency of data replication and synchronization will be greatly affected, resulting in performance degradation of the distributed system. At the same time, when Raft conducts leader election, candidates need to send requests to other nodes and obtain votes from other nodes. The increase in network delay will seriously affect the performance of leader election, resulting in long-term unavailability of the system.

2.2 The difference in network delay between nodes across spatial domains cannot be ignored

In the context of cross-spatial domains, network conditions between different nodes are diverse. The network latency between some nodes that are geographically close to each other is relatively low, while there are some nodes that are geographically far away from other nodes and therefore have larger network latency.

In this scenario, for algorithms such as Multi-Paxos, ZAB, and Raft, since the leader is the core that affects system performance, when the leader is far away from other nodes, it will become difficult to perform log replication and obtain responses from more than half of the nodes. The more difficult it is, the algorithm performance will become poor. However, there is no mechanism in the aforementioned distributed consensus algorithm to prevent this from happening.

2.3 Dynamic changes in cross-spatial domain network delay

In a single data center, the network conditions between nodes are not only low-latency, but also very stable. However, in cross-spatial domain scenarios, there is a certain degree of volatility in network conditions between nodes. Therefore, the network latency between nodes may experience extreme peaks during certain periods, and behave normally during other periods. Large fluctuations in network conditions can also have an impact on the performance of distributed systems.

For example, during system initialization of algorithms such as Multi-Paxos, ZAB, and Raft, if the network status of the leader node and the network delay of other nodes are relatively low, the system performance will be very good. However, over time, the network condition of the leader node may deteriorate and the network delay with other nodes becomes large, which will greatly reduce the performance of the system. This situation was not considered in previous distributed consensus algorithms, so new mechanisms are needed to deal with this situation. At the same time, fluctuations in network conditions have a significant impact on the data replication process of algorithms such as Raft. At certain times, the network latency will be high and the network is congested. At this time, the leader node will be slower to send data to all replica nodes. At other times, the network is relatively idle and the network latency is low. At this time, the leader node sends data to all replica nodes faster. Therefore, this characteristic of network fluctuations can be fully exploited to optimize the data replication process.

In summary, the absolute value of cross-spatial domain network delay increases, the difference in cross-spatial domain network delay cannot be ignored, and the dynamic change of cross-spatial domain network delay will have a serious impact on the distributed consensus algorithm. There are already some studies on consensus algorithms in cross-space domains. Section 3 of this article will introduce the research progress of existing cross-space domain consensus algorithms.

3 Research progress on cross-spatial domain distributed consensus algorithms

There are two technical routes for optimizing distributed consensus algorithms across spatial domains. One is deterministic network technology, and the other is optimizing distributed consensus algorithms to be more suitable for wide area networks.

3.1 Deterministic network technology

There is uncertainty in wide area network data transmission. This uncertainty is mainly reflected in three aspects: the increase in the absolute value of cross-spatial domain network delay, the non-negligible difference in cross-spatial domain network delay, and the dynamic change of cross-spatial domain network delay. There are already some deterministic network technologies trying to solve these problems. Deterministic networks are used to provide real-time data transmission and ensure certain communication service quality, such as ultra-low upper bound delay, jitter, and packet loss rate, controllable upper and lower bound bandwidth, and ultra-high lower bound reliability. Deterministic networks can meet high-quality communication needs.

The earliest deterministic network technology is IEEE 802.1 TSN (time sensitive network) technology. TSN technology is a relatively mature deterministic network standard designed by IEEE based on the data link layer (L2) of the OSI reference model. The industry has launched chips, switches, and industrial terminals that support TSN. However, TSN does not work well with wide area networks. First of all, in terms of overhead, TSN technology requires flow-by-flow state maintenance, which may be unacceptable for WAN transmission. Secondly, in terms of deployment, TSN technology relies on accurate time synchronization. In cross-domain scenarios, large-scale, In the case of long-distance transmission and complex networking, it is difficult to achieve accurate time synchronization. Therefore, TSN cannot be directly applied to WAN to achieve deterministic low latency of WAN.

TSC (time sensitive communication) is a 5G-related deterministic network technology introduced by 3GPP in the R16 standard released in July 2020. TSC extends the application scope of TSN from wired to wireless. Specifically, TSC integrates the 5G system into the TSN system as a TSN bridge. Through network slicing, deterministic forwarding, TSN management collaboration and network topology discovery capabilities, it can be used in business scenarios where fixed network coverage is difficult or mobility requirements are present. Auxiliary TSN provides deterministic network transmission services. TSC is an extension of TSN, so it cannot be directly applied to WANs.

DetNet (deterministic networking) extends deterministic network technology to the network layer (L3) of the OSI reference model. Through technologies such as resource allocation, service protection, and explicit routing, it realizes deterministic packet forwarding and routing, and realizes cross-domain communication. Deterministic transmission provides the technical foundation. DetNet is suitable for networks under single administrative control or within closed administrative control groups, such as campus networks and private WANs. For public wide area networks, DetNet faces the same problems of high overhead and difficult deployment as TSN.

New IP is a deterministic network technology proposed by Huawei, which provides a preliminary framework for deterministic low latency. New IP not only includes the deterministic IP technology of L3, but also provides a preliminary definition of the new transport layer (L4) technology based on deterministic IP. For deterministic IP technology, New IP introduces an asynchronous periodic scheduling mechanism to strictly avoid the existence of micro-bursts, thereby ensuring deterministic low-latency data forwarding capabilities. But on the one hand, New IP requires the redesign of network intermediate nodes, which makes large-scale deployment more difficult; on the other hand, New IP is only a preliminary basic network architecture, and there are still a lot of technical details that need to be completed by the industry.

To sum up, TSN is only applicable to local area networks, TSC is also only applicable to wireless local area networks, DetNet is only applicable to campus networks, and the deployment of NewIP is relatively difficult. Therefore, existing deterministic network technologies are difficult to implement on WANs, and there is still a long way to go to solve the task of deterministic WAN data transmission. At the same time, the optimization of cross-space domain distributed consensus algorithms is more practical and deployable. This article will later introduce the current optimization progress of cross-space domain distributed consensus protocols and give some optimization ideas.

3.2 Cross-space domain distributed consensus algorithm optimization

Existing cross-space domain distributed consensus algorithms (such as CURP [17], EPaxos [18]) focus on solving the above-mentioned problem of high network latency, trying to reduce the number of data replication communications in the distributed consensus algorithm to improve System performance. At the same time, according to the network environment and hardware environment of the system, the algorithm represented by Raft-Plus [19] optimizes the leader of the algorithm through election, and selects nodes with better network and better performance as leaders to improve System performance. DPaxos[20] allocates the data shards required by users to the data center closest to the location where user requests are issued, thereby reducing cross-regional data access delays.

Both CURP and EPaxos are based on the assumption that "most operations are exchangeable". The role of witnesses has been added to the CURP algorithm, allowing the algorithm to reduce the number of data communications. CURP copies all commutative operations to the witness, and the witness only guarantees the durability of the data without sorting the data. In the CURP algorithm, the client replicates each operation request to one or more witnesses while sending the request to the leader. The leader can perform operations and return to the client without waiting for data to be copied to other followers. This allows data operations to be completed within one communication round, thereby improving system performance.

Unlike the CURP algorithm, EPaxos is a leaderless distributed consensus algorithm. All replicas can accept requests from clients and only require one round of communication to submit a request, so the client can choose a closer replica to send the request to. In order to prevent proposals put forward by different replicas from conflicting with each other, EPaxos designs a log of a two-dimensional matrix, and each replica places the log in its own one-dimensional array. Each replica maintains such a matrix. When operations are conflict-free, EPaxos can perform inter-replica communication only once, otherwise it needs to be performed twice.

During the election of Raft-Plus, the follower does not directly vote for the first arriving candidate, but collects the requests of candidates within a period of time. Followers conduct speed tests with these candidates and then vote for the candidate with the lowest network latency. At the same time, the candidate will carry some parameters when sending a request to the follower. The follower will vote for the node with the strongest processing power, the most elected leaders, and the node that received the most client requests in the previous term. At the same time, Raft-Plus introduces a disapproval mechanism. If the follower finds that the leader's network delay exceeds the threshold, the follower sends a negative vote to the leader. When the leader receives more than half of the negative votes, the leader role switches to follower.

DPaxos (dynamic Paxos) is a consensus protocol based on Paxos and applied to edge computing systems for application scenarios with high latency sensitivity (such as AR/VR, etc.). This protocol dynamically allocates the data fragments required by the user to the data center closest to the location where the user request is issued, thereby reducing the latency of cross-regional data access. DPaxos proposes a regional center arbitration group to make the replication arbitration group small and close to users. At the same time, DPaxos enables the expansion of the quorum group, so that both the replication and leader election quorum groups can grow dynamically and can be expanded quickly when conflicts exist. These improvements enable DPaxos to better manage data and achieve significant performance improvements in actual deployments.

At present, the optimization of wide area networks is still not perfect, and problems such as high network latency, significant network latency differences, and dynamic changes in network latency are still difficult to fundamentally solve. Existing distributed consensus algorithms for cross-spatial domains, such as CURP and EPaxos, rely heavily on the commutativity assumption of operations and lack versatility. At the same time, RaftPlus’s optimization of Raft’s leader election has not been specifically analyzed and demonstrated. DPaxos is suitable for edge computing systems and also lacks versatility. Therefore, although these algorithms provide some solutions to solve cross-spatial domain consensus problems, more general and effective optimization ideas are still needed to improve the efficiency and reliability of cross-spatial domain distributed consensus algorithms.

4 Prospects for Research on Cross-Spatial Domain Distributed Consensus Algorithms

As mentioned before, existing cross-spatial domain distributed consensus algorithms cannot well solve problems such as high network latency, large network latency differences between nodes, and network fluctuations. This article gives some suggestions and ideas for the optimization of distributed consensus algorithms across spatial domains from the aspects of log replication and leader election of distributed consensus algorithms.

As shown in Figure 5, the optimization of the cross-space domain consensus algorithm starts from two aspects: log replication and leader election. Log replication can be optimized through mechanisms such as sending data as early as possible and sending data in a time-sharing manner, or by electing the optimal leader. and leaders’ initiative to yield and other mechanisms to optimize leader election.

60b80e61c8b4d27d2b914d4546837c98.jpeg

Figure 5 Cross-spatial domain consensus algorithm optimization

4.1 Log replication optimization

In the context of cross-space domains, due to the high cross-space domain network latency, the log replication module is most affected in the distributed consensus algorithm. There are two main ideas for shortening the log copy time: one is to reduce the number of communication rounds for data transmission; the other is to reduce the cross-spatial domain network delay. This reduces the number of communications between the leader and the follower and shortens the data transmission time. This article proposes sending data as early as possible and sending data in time-sharing to shorten the data request response time.

4.1.1 Send data as early as possible

One idea to shorten log replication time is to send data as early as possible. Most consensus algorithms use a heartbeat mechanism to trigger the sending of data, where the heartbeat time is a fixed system parameter. However, in cross-spatial domain scenarios, the network delays between the leader and other nodes vary greatly. Therefore, the leader can set a different heartbeat interval for each node.

In the cross-space domain consensus algorithm, the leader can design a smaller heartbeat time for nodes with higher network latency, so that when the leader has data, it can send the data to the remote node as quickly as possible, thereby reducing cross-space The time of domain data request. This method can effectively shorten the log copy time and improve the performance and reliability of the system.

4.1.2 Sending data by time sharing

Due to the volatility of cross-spatial domain networks, network delays will change in different time periods, so sending data from the leader to the replicas may become a bottleneck for system performance. In order to solve this problem, the strategy of peak shaving and valley filling can be adopted. During network congestion, the leader can only send data to more than half of the nodes with lower network latency, and during network idle periods, it can send the latest snapshot to lagging nodes that have not sent data before, allowing them to catch up quickly. to the leader's latest status. Compared with the previous algorithm that sent data in a fixed amount, this greatly reduces the pressure on the leader to send data during network congestion periods and reduces the amount of data sent.

4.2 Leader election optimization

Existing distributed consensus algorithms, such as Multi Paxos, Raft, CURP, etc., are designed so that all nodes in the system have the same probability of being elected as the leader. However, in a cross-space domain scenario, nodes in the system may have different software, hardware and network conditions, so leader election should be focused.

4.2.1 Elect the optimal leader

In the cross-spatial domain scenario, network conditions vary depending on the geographical location between nodes. Therefore, there are some nodes in the system that have short communication times with most nodes, while other nodes have high communication delays with other nodes in the system due to their geographical distance. In consensus algorithms, the network delay between the leader and other nodes is a key factor affecting system performance. In order to improve system performance, the consensus algorithm should design some mechanisms to make it easier for nodes with shorter communication time with other nodes to be elected as leaders.

In a distributed system, nodes can communicate in pairs to measure network latency. The median of the network delay between a node and other nodes can represent the network condition of the node. The smaller the median, the better the node's network condition. Therefore, a method can be adopted to allow nodes with better network conditions to initiate elections first and make them the leaders, thus improving the performance of the system.

4.2.2 Leaders take the initiative to give in

Due to fluctuations and changes in network latency, a node initially selected as the leader may have its network conditions deteriorate over time and be no longer suitable to serve as the leader; or the network conditions may be significantly poorer in the system. to the current leader's node. Therefore, it is necessary to implement dynamic switching of leader nodes to maintain high performance of the system. The leader can monitor the network latency of each node with other nodes. When the leader detects that the network conditions of a follower are better than its own, for example, the network delay between this node and most nodes is small and much smaller than the network delay between the current leader and most nodes, then This node is designated by the leader to initiate an election for a new leader. In addition, leader switching requires a certain time cost, so a trade-off needs to be made between switching frequency and switching time cost so that the leader does not switch too frequently.

In order to understand the above optimization ideas more clearly, Table 1 summarizes and analyzes the above optimization ideas. The optimization ideas for the consensus algorithm still start from two directions: log replication and leader election. In the direction of log replication, sending data as early as possible can shorten the time of cross-spatial domain requests, but it is necessary to monitor the status of the cluster in real time and choose an appropriate heartbeat time. Sending data in time-sharing can reduce the pressure on the leader to send data during network congestion periods, but this requires a relatively accurate prediction of network conditions. In the direction of leader election, both the election of the optimal leader and the leader's active surrender can enable the best node in the network to be elected as the leader and improve algorithm performance. However, the system's election and switching process will generate additional overhead, and further exploration is needed for election. Switching timing.

990a4d53de875a4050a6611fe1a83ad1.png

5 Conclusion

The absolute value of network delay increases, the difference in network delay between nodes cannot be ignored, and the dynamic change of network delay makes the design of cross-space domain distributed consensus protocols face new challenges. Focusing on the two core modules of log replication and leader election in the distributed consensus protocol, this paper proposes a design idea for a cross-spatial domain distributed consensus algorithm, which can provide a reference for research in the field of cross-spatial domain data management.

About the Author

Li Weiming (1999-), male, is a master's student at the School of Information, Renmin University of China. His main research direction is distributed consensus protocols.

Li Tong (1989-), male, Ph.D., associate professor at the School of Information, Renmin University of China. His main research directions are new generation Internet architecture, cross-domain data management and big data.

Zhang Dafang (1998-), male, is a master's student at the School of Information, Renmin University of China. His main research direction is distributed consensus protocols.

Dai Longchao (1996-), male, is a master's student at the School of Information, Renmin University of China. His main research directions are cross-domain data management and big data.

Chai Yunpeng (1983-), male, Ph.D., professor and doctoral supervisor at the School of Information, Renmin University of China, deputy director of the Department of Science and Engineering, and director of the Department of Computer Science and Technology, Renmin University of China. His main research directions are database management systems, storage systems, cloud computing.

contact us:

Tel: 010-81055490

       010-81055534

       010-81055448

E-mail:[email protected] 

http://www.infocomm-journal.com/bdr

http://www.j-bigdataresearch.com.cn/

Travel, collaboration:010-81055307

Big data journal

"Big Data Research (BDR)" bimonthly is a journal supervised by the Ministry of Industry and Information Technology of the People's Republic of China, sponsored by the People's Posts and Telecommunications Publishing House, under the academic guidance of the Big Data Expert Committee of the China Computer Society, and published by Beijing Xintong Media Co., Ltd. , has been successfully selected into the core journals of Chinese science and technology, the Journal of the China Computer Federation, the recommended Chinese science and technology journals by the China Computer Federation, as well as the hierarchical catalog of high-quality scientific and technological journals in the field of information and communication, the hierarchical catalog of high-quality scientific and technological journals in the field of computing, and has been rated as a national The most popular journals in the "Comprehensive Humanities and Social Sciences" discipline of the Academic Journal Database of the Philosophy and Social Sciences Documentation Center.

3e63532195f9737fb3dfc5b246ce7ac3.jpeg

Follow the WeChat official account of "Big Data" journal to get more content

Guess you like

Origin blog.csdn.net/weixin_45585364/article/details/132680355