[Distributed]-Raft paper study

Preface

This article is a study of the Raft paper. The parts in the original paper evaluating Paxos and experimentally evaluating the properties of Raft are skipped. The yellow characters or quotations that appear in the article are my personal understanding or relevant content I have read in other articles other than the article, as a supplement.

Summary

Raft is a consensus algorithm used to manage a series of log copies . Its effect is the same as Paxos, and its performance is comparable to Paxos. But its architecture is different from Paxos, which is easier to understand, and thus provides a better foundation for building a practical system.

In order to improve the understandability of the algorithm, Raft splits out the key elements of consistency, including leader election , log replication , member changes , and security . And it enforces a stronger consistency, reducing the number of states that need to be considered in the system.

Raft provides a new mechanism for changing members in the cluster, the overlapping majority mechanism, to ensure security when members change.

introduce

The consensus algorithm enables a series of computers to operate as a consistent whole and can tolerate errors in some computers. Even if there are errors in some computers, the entire whole can run normally. Therefore, the consensus algorithm plays a very critical role in building a reliable large-scale software system.

For a long time in the past, Paxos dominated all discussions about consensus algorithms. But it was so difficult to understand that we decided to find a new, easier-to-understand consensus algorithm to provide a better basis for building the system. Our initial purpose is relatively unique - understandability : whether it is possible to define a consensus algorithm that can be used in actual systems and is far easier to understand than Paxos. More importantly, we hope that this algorithm can promote the understanding of the entire algorithm by those involved in system construction, not only for the algorithm to work, but also to understand why the algorithm works.

Thus, the consensus algorithm - Raft was born. In order to improve comprehensibility, we use decomposition (that is, divide the whole algorithm into different parts such as leader election and log replication), and state space reduction (that is, reduce the number of inconsistent states between servers, or the number of soft states). These two very Important thoughts.

Raft is similar in many ways to existing consensus algorithms, but it has some unique characteristics:

  • The leader has stronger leadership in the cluster : for example, log entries are only passed from the leader to other servers, which simplifies the management of log copies;
  • Leader election mechanism : Raft uses a random timer to elect a leader, and each follower will initiate an election when its timer ends;
  • Member change : Raft uses a new joint consensus method for the mechanism of changing servers in the cluster. This method will cause the server groups in the two configurations before and after the change to overlap during the transition period, so that the entire cluster can change during the configuration change. Continue to operate normally.

Raft is simpler and easier to understand than other consensus algorithms, and it is fully sufficient to meet the needs of a practical system.

Copy state machine

Replicated State Machines Replicated\ State\ MachinesIn this method , the state machine   in a cluster will calculate multiple identical copies of the same state , so that even if some servers in the cluster are down , machine, the entire system can continue to run. Replicated state machines are used to solve a series of fault-tolerance problemsin distributed systems.

Replicated state machines are usually implemented using replicated logs. Each server stores a log containing a series of commands. The state machine executes these commands in sequence. As long as each state machine is normal, they will evolve to the same state machine. status, and get the same output sequence.

Then, ensuring the consistency of the replica log is the job of the consensus algorithm : the consensus module on one server will receive the command from the client, then append it to its log, and at the same time communicate with the consensus modules of other servers to ensure that all servers Each log can eventually contain the same commands in the same order, even if some servers fail

As long as all commands are successfully replicated, each server's state machine executes the commands in order, and then responds with output to the client. As a result, the entire cluster appears to the outside world as an independent, highly reliable state machine.

For a consensus algorithm to be applied in a real system, it needs to have the following properties:

  • Ensure safety , even if you encounter network delay, network partition, packet loss, etc., an incorrect result will never be returned;
  • Availability is available as long as a majority of the servers in the cluster are operational and can communicate with each other and clients. For example, a cluster with five servers can tolerate the downtime of any two servers, and these failed servers can restore the state data according to the stored data after subsequent recovery, and then return to the cluster;
  • Do not rely on time to ensure log consistency , and do not guarantee that consistency will be reached within a certain period of time, because wrong clocks or extreme message delays will cause feasibility problems;
  • In most cases, a command is completed as long as most of the servers in the cluster respond to the RPC request . A small number of servers that respond slowly should not affect the performance of the entire system.

Designing towards understandability

On the one hand, we believe that Paxos is difficult to understand and cannot provide a good foundation for education or the construction of a practical system; on the other hand, consensus algorithms are very important in large software systems, so we try to see if we can design a more Optional, a consensus algorithm with better properties than Paxos. As a result, we get the Raft algorithm

We have several goals for Raft:

  1. It must be able to provide a complete and practical foundation for system construction and reduce additional design work for developers.
  2. Must be safe under all circumstances and feasible under certain typical operating conditions
  3. For various common operating behaviors, it needs to be completed efficiently
  4. The most important and difficult thing is to be understandable, which not only requires most people to understand the entire algorithm, but also requires people to have enough knowledge of the algorithm, so that system builders can work around it. The algorithm expands on things that are inevitably needed in real implementations

Consensus Algorithm Raft

Raft first elects a leader, who is responsible for managing log copies, receiving log entries from clients, and replicating these entries to other servers. The leader mechanism simplifies the management of log copies. For example, a leader can decide which part of the log to store new entries without asking other servers, and the data flow is very simple, only from the leader to other servers. When the leader fails or loses connection, a new leader will be elected.

Through the leader mechanism, Raft decomposes the consistency problem into three relatively independent sub-problems:

  1. Leader election : When the existing leader fails, a new leader must be elected
  2. Log replication : The leader must receive log entries from clients and replicate them across the cluster, forcing other servers' logs to be consistent with its own.
  3. Security : The most critical security in Raft is state machine security State Machine Safety Property State\ Machine\ Safety\ PropertyStat e M a chine S a f e t y Pro p er t y If a server    has applied a log entry to its state machine, then other servers cannot be in the same log index index store a different command at

The basics of Raft

A Raft cluster contains a series of servers, typically 5 servers. At this time, the system can tolerate the failure of two servers. At any time, each server can only be in one of three states: leader , follower , or candidate . Under normal circumstances, there will only be one leader in the system, and the rest are followers.

Followers are passive. They do not issue requests themselves, but only respond to requests from the leader and candidate.

The leader will handle all client requests. If a client initiates communication with a follower, the follower will redirect the communication and forward it to the leader.

Raft divides time into multiple terms . The length of each term is arbitrary. Terms are numbered according to consecutive numbers. Each term starts with an election .Rather than the birth of a Leader, during the election phase, one or more candidates will try to become the leader. The candidate who wins the election will serve as the leader for the remainder of the term.

In some cases, the election results will be split. At this time, the term will end directly if there is no leader. Then a new election and a new term will start soon.

Raft guarantees that there will be at most one leader in a given term.

Different servers may observe transitions between terms at different times, and sometimes a server may not observe an election, or even an entire term.

A term can be regarded as a logical clock in Raft , which allows the server in the system to discover some expired information, such as an outdated leader. Each server will store the number of the current term, which will increase monotonically and strictly over time.

The number of the current term will be exchanged during communication between servers. If a server's number is smaller than that of other servers, it will update its number to this larger value; if a candidate or a leader finds that its number is out of date, it reverts to the follower state; if a server receives a request with an outdated term number, it will reject the request

Communication between servers in Raft uses RPC . The basic consensus algorithm only requires two RPCs: RequestVote RPC , which will be initiated by the candidate during the election;
AppendEntries RPC , which will be initiated by the leader, in order to copy log entries and serve as a heartbeat mechanism

In addition, the third type of RPC will be mentioned later, which is used to transfer snapshots between servers

Servers retry the RPC if the initiated RPC does not receive a timely response, and for better performance, they initiate the RPC in parallel

Leader election

Raft uses a heartbeat mechanism to trigger leader election

They are all followers when the server starts running. A server remains a follower as long as it receives valid RPCs from either the leader or the candidate. The leader will periodically send heartbeats to all followers to maintain their identities. The heartbeats sent are AppendEntries RPCs without any log entries

If a follower exceeds a period of time, which we call election timeout , and does not receive any communication, then it will consider that there is no surviving leader, and then start an election to elect a new leader.

To start an election, a follower needs to increment its current term value and then transition to the candidate state. It will then cast a vote for itself and initiate a RequestVote RPC to other servers in the cluster.

A candidate will maintain its own candidate status until the following three situations occur:

  1. It won the election.Then it will transition to leader state
  2. Another server establishes itself as the leader.Then it will transition to follower state
  3. There was no winner for some time.Then it will also transition to follower state

When a candidate gets the votes of most servers in the entire cluster in the same term , it wins the election. In a given term, each server will give its only vote to at most one candidate according to the first-come-first-served principle .

The majority principle ensures that in a term, only one candidate can win the election at most. This is election security. Election Safety Property Election\ Safety\ PropertyElection Safety Property

Once a candidate wins the election, it becomes the leader and will send heartbeat messages to other servers to establish its identity and prevent new elections.

While waiting for a vote to arrive, a candidate may receive an AppendEntries RPC from another server announcing that it has become the leader. If the term number of the leader carried in the RPC request is greater than or equal to the candidate's current term, then the candidate will consider the leader to be a legitimate leader and return to the follower state. On the contrary, if the term number of the leader is less than the current term of the candidate, the candidate will reject the RPC and continue to maintain the candidate state.

The third possible situation is that each candidate neither wins nor loses the election: if multiple followers become candidates at the same time, the votes will be spread out, resulting in no one candidate getting a majority of votes . When this happens, each candidate will time out, then increase its own term number, start a new election, and then initiate a new round of RequestVote RPC. However, if additional measures are not taken, there is no end to the possibility that vote splits will occur

Raft uses randomized election timeouts to ensure that split votes occur rarely and can be processed quickly.

First, the server will randomly get a number from a fixed range as the election timeout length , such as 150~300ms. This allows the election time of each server to be distinguished, so that in most cases only one server will time out, and there is a chance to initiate an election before other servers time out and win the election and then send heartbeats to them.

Each candidate will restart its random election timeout at the beginning of an election , and then wait for the timeout to expire before starting the next election. This reduces the likelihood of another vote split at the next election

log copy

Once a leader is elected, it will start receiving client requests and serving them. Each client request contains a command that needs to be executed by the replicated state machine . The leader will append the command to its log as a new entry in the log , and then issue an AppendEntries RPC to every other server to have them replicate the entry.

When the entry is safely replicated, the leader will apply the entry to the state machine and return the result to the client. If the follower is down or runs slowly, or packet loss occurs, the leader will continue indefinitely. Retries the AppendEntries RPC until all followers have stored all log entries, although it may have already responded to the client, so entries being applied does not require that all followers have stored the entries.

When each log entry is received by the leader, it will contain a state machine command and term number. These term numbers help to detect log inconsistencies between servers.

Each log entry will contain a numerical index , marking its position in the entire log. The index of each entry is different, but the term number may be the same between different entries.

The leader needs to determine when it is safe enough to apply the log entry to the state machine. An entry that is applied to the state machine is called a committed entry . Raft guarantees that all committed entries are persistent and will eventually be executed by all available state machines.

When the leader that created the entry succeeds in replicating the entry in most servers, the entry will be submitted , which also submits all previous entries in the leader's log

The leader will continue to maintain the highest index it has submitted, and then carry this index in AppendEntries RPC so that other servers can also know the highest index. As long as they compare and find that a certain log entry has been submitted, it will Entries are applied to their own state machine .

Raft composes log matching by ensuring the following two properties Log Matching Property Log\ Matching\ PropertyLog M a t ching P r p er t y : If   two entries are in different logs, but have the same index and the same term number, then they must store the same command command; and these logs All entries before this index must also be the same

The former attribute comes from the fact that the leader will only create one entry with a given index in a given term, and the position of the entry in the log will never change;

The latter property is guaranteed by a simple consistency check brought by AppendEntries RPC: when sending an AppendEntries RPC, the leader will carry the entry in its own log that is before the new entries to be sent. The index and term number. If the follower does not find an entry with the same index and term in its own log, it will reject the new entry sent. As long as the AppendEntries RPC returns successfully, the leader can know that the follower's log is exactly the same as its own.

Under normal operating conditions, the logs between the leader and the follower will remain consistent, so AppendEntries RPC can always return successfully. However, a leader crash may cause log inconsistency, and it may not fully copy all entries in its log to followers before crashing. In the case of a series of leader or follower crashes, these inconsistencies are more serious: the follower may not have some entries that the leader has; it may have some entries that the leader does not have; or both may occur at the same time. Missing or extra entries may span multiple terms.

In Raft, the leader will force the follower's log to replicate its log to handle inconsistencies, which means that conflicting entries in the follower will be overwritten by entries in the leader's log.

In order for the follower's log to reach a consistent state with its own log, the leader needs to find a consistent and latest index in the two logs, that is, find a latest index that satisfies both the index and all the indexes before this index. The entries stored in the logs are all the same . Then delete all entries after the index in the follower, and then send all entries after the index in your own log to the follower.

These actions will occur in response to the consistency check mentioned above. The leader will save a nextIndex parameter for each follower . This parameter represents the index value of the next entry that the leader will send to a follower. When a server just becomes leader, it will initialize all nextIndex values ​​to the next index value of the last index in its log.

If a follower's log is inconsistent with the leader's, the consistency check AppendEntries RPC will fail. After the failure, the leader will decrement the nextIndex and then retry the AppendEntries RPC. Finally, the nextIndex will reach an index value that matches the leader's log and the follower's log. At this time, the AppendEntries RPC will also return successfully, and at the same time All entries in the follower's log that conflict with the leader have also been removed, and log entries from the leader have been appended. As long as the AppendEntries RPC returns successfully, the follower's log will be consistent with the leader's and will remain in this state for the remainder of the term.

If necessary, the protocol can be optimized to reduce the number of failed AppendEntries RPCs; for example, when rejecting an AppendEntries RPC, the follower can carry the term number of the conflicting entry and the first index stored in that term . With this information, the leader can directly skip the conflicting entries in this term when decrementing nextIndex. In this case, for each term with a conflicting entry, only one AppendEntries RPC is needed to confirm whether it is a conflict, instead of the original method where each entry requires an AppendEntries RPC to confirm whether it is a conflict. In practice, we are not sure whether this optimization is necessary because AppendEntries RPC failures do not occur frequently and it is unlikely that there will be many inconsistent entries., so this optimization may not be necessary

Through this mechanism, a newly emerged leader does not need to take special actions to restore log consistency. It only needs to start running normally, and the logs will become consistent as the consistency check of AppendEntries RPC eventually succeeds after errors.

A leader will never overwrite or delete entries in his own log. This means that the leader only appends attributes. Leader A append − O nly Property Leader\ Append-Only\ PropertyLeader AppendOnly Property

This log replication mechanism also shows ideal consensus properties: as long as most servers in the cluster are running normally, Raft can receive, copy, and apply new log entries, and under normal circumstances only one round of RPC is required. New entries can be replicated to most servers in the cluster, and a single slow follower will not affect overall performance.

Safety

The foregoing content describes how Raft elects leaders and replicates log entries. But the mechanisms discussed so far are not enough to ensure that each state machine correctly executes the same commands in the same order. For example, a follower may become unavailable when the leader commits some log entries. Subsequently, it may be elected as the leader and use some entries to overwrite the committed entries. In this case, different state machines may be executed. different command sequences

This section will improve Raft by adding a limit on which server may be elected as leader. This restriction ensures that the leader of any given term will contain all submitted entries in past terms, that is, the leader integrity property, Leader Completeness Property Leader\ Completeness\ PropertyL e a d er C o m p l e t e n ess P r o p er t y   , through this restriction, the rules about the commit operation will be more accurate

electoral restrictions

In any consensus algorithm that uses the leader mechanism, the leader must eventually store all submitted log entries. Raft uses a simple method to ensure that from the moment each leader becomes leader, it already has all submitted entries in past terms, without sending certain entries to it . This also means that log entries will only flow in one direction, from leader to follower; and leaders will never overwrite entries that already exist in their logs.

Raft will prevent a candidate from winning the election during the voting process unless its log contains all submitted entries. A candidate must be supported by most servers in the cluster in order to be elected, which means that each submitted entry must appear in at least one of these servers ,Because the entry is submitted it means it must exist in most servers. If the candidate's log is at least as up -to-date as the log of every server in this section , it will naturally be as up-to-date as the server with every submitted entry, so it will have all submitted entries.

The RequestVote RPC implements this restriction: this RPC contains information about the candidate's log if the voter's log is more up-to-date than the candidate's. Then the voter will refuse to vote for the candidate

Raft judges which log is more up-to-date by comparing the term and index of the last entry in the two logs. If the last entries in the two logs have different term numbers, then the log with the larger term is updated; if the terms of the last entries in the two logs are the same, then which log has more entries in the interval of the same term. Who is more up-to-date

Submit entries from past terms

As mentioned earlier, a leader knows that once an entry is stored on most servers, the entry will be submitted. If a leader hangs up before submitting the entry, future leaders will attempt to continue replicating the entry.

But a leader cannot quickly infer that an entry from a previous term was committed when it was stored on most servers. It is possible that an old entry, although stored on most servers, is still committed by a subsequent The ones that have not been replicated to this entry are overwritten by the leader. In order to eliminate such problems, Raft will never submit entries from past terms by counting replicas, but will only submit entries for the current term of the leader by counting replicas. Once an entry for the current term is submitted, then due to the log matching Log Matching property Log\ Matching\ propertyWith the existence of LogMatchingproperty , all earlier entries will also be   submitted indirectly . _

This provides a certain degree of robustness. An entry is replicated to most servers and should be submitted at this time, but the leader is down and does not submit normally. But the subsequent leader will have this entry, because in the election rules we know that the server that can become the leader must have the same up-to-date log status as most servers. This leader does not need to additionally process the entry that needs to be submitted but has not been submitted. It only needs to run normally. When receiving the next entry, in order to copy the new entry to other servers, according to the Log Matching Property and consistency check, other previously Servers that have not received the old entry will also copy the old entry, which completes the continued replication of the old entry. When a new entry is submitted, all entries accumulated up to this entry are also submitted. So don't worry about uncommitted entries. From this perspective, whether the entry is submitted ultimately depends on whether it is replicated to most servers.

In some cases the leader can also safely infer that an old log entry has been committed, for example, an entry is stored on each server. But Raft takes a more conservative approach

Raft's commitment rules cause this additional complexity because when the leader copies entries from past terms, these entries retain their original term numbers. This approach makes log entries easier to understand because they maintain the same term number across logs at any time.

safety demonstration

Now we want to demonstrate more clearly that the leader integrity, Leader Completeness Property, is established. We assume that Leader Completeness Property does not hold, and then verify a contradictory event

Suppose the leader of term T commits an entry in its term, but this entry is not stored by the leaders of some subsequent terms. Assume that term U is the smallest term among the terms in which the subsequent corresponding leader does not store that entry, that is, the leaders in all terms between T and U have that submitted entry.

  1. When leaderU is elected leader, the submitted entry does not exist in the log
  2. leaderT replicates the entry to most servers in the cluster, and leaderU receives votes from most servers in the cluster. Therefore, at least one server among these voters both received the entry for leaderT and voted for leaderU. This voter is the key to the conflict
  3. This voter must have received the submitted entry from leaderT before voting for leaderU, otherwise it will reject leaderT's AppendEntries RPC.
  4. The voter still stores the entry when it votes for leaderU, because every intermediate leader contains this entry, the leader never removes the entry, and the follower only removes the entry if it disagrees with the leader
  5. The voter voted for leaderU, so leaderU's log must be the same up-to-date as the voter's. This brings about two contradictions
  6. First, if the voter is the same as the last term in leaderU's log, then the content of leaderU's log must be at least as much as that of the voter, so its log contains all entries in the voter's log. This is contradictory, because voters include all submitted entries. According to this idea, leaderU will also have all submitted entries, but our assumed leaderU is not the case at the beginning
  7. In addition, the last term in leaderU's log is definitely larger than the voter's, and it is also larger than leaderT's, because the last term in the voter's log is at least T. After leaderT, the leader that created the previous entry in leaderU's log before leaderU must have that committed entry in its log (according to our assumption). Then, through the Log Matching Property, leaderU's log must also contain the submitted entry, which contradicts our assumption.
  8. Argument complete. To sum up, the leaders of all terms after T must include all entries from term T that were submitted in term T.
  9. The Log Matching Property ensures that future leaders will also include entries that were submitted indirectly

Through the Leader Completeness Property, we can prove the state machine safety State Machine Safety Property State\ Machine\ Safety\ PropertySt a t e M a c h i n e S a f e t y Pro p er t y , that is, if    a server has applied an entry on a given index to its state machine, then no other server will Apply a different entry at the same index.

When a server applies an entry to its state machine, the part of its log up to the entry must be exactly the same as the leader, and the entry must have been submitted. Consider that any server applies the smallest term among the terms of a given entry. The Log Completeness Property will ensure that the leader of all larger terms will store the same entry, so the index will be applied in subsequent terms. What the server applies will be the same value. Therefore, the State Machine Safety Property is established

Finally, Raft requires the server to apply entries in the order of the indexes in the log. Combined with the State Machine Safety Property, this means that all servers will apply the same set of entries to their state machine in exactly the same order.

follower and candidate are down

So far we have focused on leader failures. Follower and candidate failures are simpler to handle than leader failures, and they are handled in the same way.

If a follower or candidate is down, subsequent Request Vote RPC or AppendEntries RPC sent to it will fail. Raft handles these failures by endlessly retrying; if the downed server is restarted, the RPC can complete successfully.

If a server goes down after completing an RPC but not yet replying, it will receive the same RPC when it is restarted. RPC in Raft is idempotent , so there will be no consequences. For example, if a follower receives an AppendEntries RPC that contains some entries that are already in its log, it will ignore those entries in the RPC and only receive the entries that it does not contain.

Timing and availability

One of our requirements for Raft is that security should not depend on time: the system should not produce incorrect results just because some events occur too quickly or too slowly than expected.

However, availability (the system's ability to respond to clients in a timely manner) is inevitably dependent on time. For example, if message communication takes longer than the time between two server downtimes, that is, longer than the time a server has been up, then the candidate will not be able to wait long enough to win the election because the message communication time is longer than the server downtime. The failure-free time is still long, so the candidate may have failed before the Request Vote RPC returns. Therefore, the cluster will not have a stable leader, and Raft will not be able to provide services stably.

Leader Election Leader Election is a very critical aspect of Raft where the time factor is important. As long as the system meets the following timing requirements, Raft can normally elect and maintain a stable leader:

b r o a d c a s t T i m e < < e l e c t i o n T i m e o u t < < M T B F broadcastTime << electionTimeout << MTBF broadcastTime<<electionTimeout<<MTBF

In this inequality, broadcastTime refers to the average time it takes for a server to send RPCs to each server in the cluster in parallel and receive their replies;

electionTimeout is the time that a follower does not receive the leader's heartbeat before trying to initiate an election;

MTBF refers to the average time between two failures of a server, that is, the mean time between failures of a server.

The broadcast time should be an order of magnitude less than the election timeout so that the leader can reliably send the necessary heartbeat messages to prevent followers from starting elections. Taking into account the randomization method used in the election timeout, this inequality also makes vote splitting unlikely to occur.

The election timeout should be an order of magnitude less than the MTBF, so that the system can provide services stably without frequent leader election and replacement.

When the leader goes down, the system will be unavailable for approximately the duration of the election timeout, which we hope will only be a small fraction of the total time.

Broadcast time and MTBF are properties of the underlying system and are difficult for us to interfere with. But election timeout is something we have to choose. RPC in Raft requires the receiver to persist relevant information, so the broadcast time takes 0.5ms to 20ms, depending on the storage technology. This results in an election timeout that may range from 10ms to 500ms. Typical server MTBF is a few months or more, which easily meets the time requirement

membership changes

So far we have considered the configuration of a cluster , the set of servers participating in the overall consensus algorithm, to be fixed. In practice, it is sometimes necessary to change this configuration. For example, replace the server when it fails, or change the number of replicas

Although this can be accomplished by taking the entire cluster offline, updating the configuration files, and then restarting the cluster, this will make the entire cluster unavailable during the changeover period. And, if any of these operations are manual, you run the risk of operator error

To avoid these problems, we decided to automate the process of configuration changes and merge them into the Raft consensus algorithm

In order for the configuration change mechanism to be safe, there must be no point in time during the transition where it is possible for two leaders to be elected at the same time in the same term.

Unfortunately, any method of changing a server directly from an old configuration to a new configuration is insecure. Since it is impossible to atomically change all servers at the same time, the entire cluster could potentially split into two independent parts during the transition period.

To ensure safety, configuration changes must use a two -phrase approach. There are many ways to achieve two phases

For example, some systems use the first phase to invalidate the old configuration so that it cannot handle client requests; then the second phase enables the new configuration to function

In Raft, the cluster first transitions to a transitional configuration, which we call joint consensus ; once the joint consensus is committed, the system transitions to the new configuration. joint consensus combines the old configuration with the new configuration:

  • log entry is replicated to all servers in both configurations
  • A server from any of the two configurations may become the leader
  • Consent for election and entry submission commitments, requiring majority server agreement in the old and new configurations, respectively

Joint consensus allows each server to transition to another configuration at different times without compromising security. Furthermore, joint consensus allows the cluster to continue serving client requests during configuration changes

Cluster configuration information is stored and communicated using special entries in the replica log.

When the leader receives a request to change the configuration, it stores the joint consensus configuration information in the form of an entry in the log and then copies the entry. As soon as a given server adds an entry with a new configuration to the log, it will use this configuration in future decisions, because a server will use the latest configuration in the log, regardless of whether the entry corresponding to this configuration is already submitted. This means that the leader will use the joint consensus rules to decide when the joint consensus entry should be submitted.

If the leader goes down, a new leader must be elected either from the old configuration or from the joint consensus, depending on whether the candidate who won the election has received the entry corresponding to the joint consensus. In any case, servers in the new configuration cannot make unilateral decisions during this period

Once the joint consensus is committed, neither the old configuration nor the new configuration can make a decision without the agreement of the other configuration. Moreover, the Log Completeness Property ensures that only the server with the entry corresponding to the joint consensus can be elected as leader.

At this point, the leader can safely generate entries describing the new configuration and replicate them to the cluster. Again, this configuration will take effect on the server as soon as it is seen.

When the new configuration is committed, the old configuration is irrelevant, and servers not in the new configuration can be shut down.

There is no time to make unilateral decisions between the old configuration and the new configuration, which ensures security

There are still three problems with reconfiguration reconfiguration:

The first is that new servers may not store any entries initially. If they are added to the cluster in this state, it will take some time to catch up with the progress of storing entries in the cluster. During this time May not be able to commit new entry

In order to avoid the availability gap, Raft adds a stage before configuration changes. In this stage, new servers will join the cluster as non-voting members . The leader will copy the log entries to them, but they will not Will be considered part of most servers. As long as the new server catches up with the progress of the other servers in the cluster, the reconfiguration should proceed normally

The second problem is that the leader of the cluster may not be part of the new configuration. In this case, the leader will return to the follower state after it commits the newly configured entry. This means that it exists for a period of time, that is, the leader commits the new configuration. When the entry was configured, it was managing a cluster in which it did not exist; it replicated the entry but did not count itself among the majority servers

The transition of the leader occurs when the new configuration is committed, because this is the first point in time when the new configuration can run independently. It is always possible to elect a leader in a new configuration. Before this point in time, only one server belonging to the old configuration may be elected as leader.

The third problem is that removed servers, i.e. servers that are not in the new configuration, can disrupt the entire cluster. These servers will not receive heartbeats, so they will time out and start a new election. They will send a Request Vote RPC carrying the new term number, which will cause the current leader to transition to follower state.

Finally, a new leader will be elected in the new configuration, but the removed server will time out again and repeat the process, resulting in reduced availability.

To prevent this problem, when the server believes that the current leader is alive, it will ignore the Request Vote RPC. Specifically, if a server receives a Request Vote RPC within the current leader's minimum election timeout, it does not update its term or cast its vote

This does not affect normal elections, because under normal circumstances each server will wait at least a minimum election timeout before starting an election, so it is ok to ignore Request Vote RPCs received within this time. But this avoids interference from removed servers: if a leader can get heartbeats to the cluster, it won't be deposed by a higher term number

Single node change is also a way to change the cluster member structure: in this way, the cluster will only change one node at a time. Specifically, if you want to change a 3-node cluster to a 5-node cluster, you need to change it twice. The change operation changes the cluster once to a 4-node cluster and then to a 5-node cluster. This method ensures that the majority of the two configurations before and after each change must intersect, and it is impossible to produce two leaders during voting, so that the configuration can be changed safely. Using this method, there is no need to introduce a transitional state of joint consensus in the cluster.

log compression

The log in Raft will grow as more client requests are merged during normal operation, but in a real system, it cannot grow without an upper limit. As the log gets longer, it takes up more space and takes more time to load out. Not having some mechanism to discard outdated information accumulated in the log will eventually lead to availability problems.

Taking a snapshot is the simplest method of compression. In the snapshot mechanism, the current state of the entire system is written to a snapshot and stored in persistent storage, and then all logs up to the time the snapshot is generated are discarded. Snapshotting mechanism Snapshotting is used in both Chubby and Zookeeper. The rest of this section will describe the snapshot mechanism in Raft.

Incremental compression methods, such as log cleaning and log-structured merge tree (LSM tree), are also possible. These methods only operate on a portion of the data at a time, so they spread the load of compression operations more evenly over time along the timeline.

They first find a region of the data that has accumulated a lot of deleted and overwritten objects, then more compactly rewrite the surviving objects in this region, and then free the region. This requires significant additional machinery and complexity compared to snapshotting, which simplifies the problem by always operating on the entire dataset

In Raft's snapshot mechanism, each server takes a snapshot independently, overwriting the committed entries in its own log. Most of the work lies in the state machine writing its current state to the snapshot. Raft also includes some metadata in the snapshot:
the last included index is the index of the last entry in the log replaced by the snapshot, or the index of the last entry that the state machine has applied; the
last included term is the term of this entry .

These data are saved to support the consistency check when AppendEntries RPC finishes sending the first log entry after this snapshot, because the previous log index and term need to be known when processing this entry. To support cluster membership changes, the snapshot also contains the latest configuration information in the log . Once the server writes the snapshot, it deletes all log entries up to the last included index, as well as all previous snapshots.

Although servers normally take snapshots independently, the leader needs to occasionally send snapshots to lagging followers. This usually happens when the leader has dropped the next entry it needs to send to a follower. Fortunately, this situation is unlikely to occur under normal circumstances. A follower that has caught up with the leader's progress will own this entry. But an unusually slow follower or a new server joining the cluster will not, so the way to bring such a follower up to date is to have the leader send it a snapshot over the network

The leader uses a new RPC called InstallSnapshot to send snapshots to followers that are far behind.

When a follower receives this RPC and its snapshot, it must decide what to do with the current log entries.

Normally the snapshot will contain new information that does not exist in the recipient's log. In this case, the follower discards the entire log and is completely replaced by the snapshot, and there may be an uncommitted entry that conflicts with the snapshot; if on the contrary, the snapshot received by the follower describes a certain part of the prefix of its own log, then the entry covered by the snapshot The entries will be deleted, but the entries after the snapshot are still valid and retained.

This snapshot method deviates from Raft's strong leader rule because followers can take snapshots without the leader's knowledge. But we think this departure is reasonable.

Although having a leader can help avoid conflicting decisions during the process of reaching consistency, consistency is reached when the snapshot is taken, so no decisions are conflicting. Data will still only flow from leader to follower, and only followers will reorganize their data

We consider a leader-based approach, where only the leader will generate a snapshot, and then it will send the snapshot to other followers. This approach has two disadvantages:

First, sending a snapshot to each follower consumes network bandwidth and slows down the entire snapshot processing process. Each follower already has the information needed to generate its own snapshot, and it is less expensive for a server to generate a snapshot of its own local state than to send and then receive a snapshot over the network. The greater significance of the snapshot mechanism lies in data compression.rather than consistency

Second, the implementation of the leader will be more complicated. For example, the leader needs to send snapshots to followers in parallel and copy new log entries to them so as not to block new client requests.

There are two other issues that affect the performance of the snapshot mechanism. First, the server must decide when to take snapshots. If a server takes snapshots too frequently, it will waste disk bandwidth and resources; if it takes too few snapshots, it risks exhausting its own storage capacity and reloading the log when restarting. It takes longer. A simple strategy is to take a snapshot when the log reaches a fixed byte size. If the size is set to be much larger than the expected size of a snapshot, the disk bandwidth overhead used for the snapshot will be smaller.

The second performance issue is that writing a snapshot can take a lot of time, and we don't want this to delay normal operations. The solution is to use copy-on-write technology, where new updates can be accepted if they do not affect the snapshot being written. For example, the copy-on-write technology of the operating system

client interaction

This section describes how the client interacts with Raft, including how the client finds the leader of the cluster and how Raft supports linearizable semantics .

Linear semantics, in layman's terms, is linear consistency, that is, strong consistency, which means that after an operation is executed, it must be visible to all individuals immediately and only executed once.

These problems apply to all consensus-based systems, and Raft's solution is similar to other systems.

Clients in Raft send all their requests to the leader. When a client first comes online, it connects to a randomly selected server. If the client's first choice is not the leader, the server will reject the client's request and provide information about the closest leader it knows about, because the AppendEntries RPC contains the leader's network address. If the leader goes down, the client's request will time out, and then the client will try the request again to a randomly selected server.

Our goal for Raft is to achieve linearizable semantics, where each operation is executed only once and instantaneously between its call and the response. But according to what has been described so far, Raft may execute a command multiple times: for example, if the leader crashes after committing the entry but before responding to the client, the client will retry the command with a new leader. Causes this command to be executed a second time

The solution is to let the client assign a unique sequence number to each command , and then the state machine tracks and records the latest processed sequence number and the corresponding response for each client. If it receives a command with a corresponding sequence number that has already been executed, it will respond quickly and will not re-execute the request.

Read-Only requests can be processed without any writing to the log. But without additional measures, this risks returning stale data, since the leader that is replying to the request may have been replaced by a new leader that it does not know about. Linearizable reads must not return stale data, and Raft requires two additional precautions to ensure this without using the log

First, the leader must have the latest information about the committed entries. The leader Completeness property ensures that a leader has all the committed entries, but at the beginning of his term, it may not know what these entries are. To find out, it Need to submit an entry in its term

Raft handles this problem by letting each leader submit a blank no -op entry to the log at the beginning of the term. By submitting this no-op entry, the leader can have the latest information about the submitted entry.

Second, a leader must check whether it has been deposed by the check before processing a read-only request, because if a newer leader has been elected, its information may be out of date.

Raft handles this problem by having each leader exchange heartbeat messages to a majority of servers in the cluster before responding to read-only requests. Alternatively, the leader could rely on a heartbeat mechanism to provide a form of leasing, but this would require clock dependence for safety (assuming bounded clock skew)

Guess you like

Origin blog.csdn.net/Pacifica_/article/details/127863287