Detailed explanation of Raft graphics

Detailed explanation of Raft graphics

refer to:

  1. Raft lecture (Raft user study) - YouTube

  2. Raft PDF

  3. Detailed explanation of Raft algorithm - Zhihu (zhihu.com)

Let’s introduce the Raft protocol in detail today

Raft is an agreement to solve formula problems, so what is consensus?

In a distributed system, consensus refers to multiple nodes reaching a consensus result on a certain state, and the consensus algorithm is used to ensure system consistency, for example, for the order in which an event occurs, the value corresponding to a certain key, who is the leader etc.


So how to achieve consensus?

Now there are two main methods. The first is a symmetrical and leaderless method, that is, the servers are equal, and logs can be added or copied. The client can interact with any server.

The second method is: asymmetric, with a leader. There is one server in the cluster that is responsible for overall management, the other servers just passively accept her decision, and the client interacts directly with the leader

Raft is leader-based. It decomposes the problem into two things. The first part is the normal operation when there is a leader, and the second part is what to do to elect a new leader when the leader crashes. The advantage of this is that the whole system is very simple during normal operation, and there is no need to consider conflicts between different leaders, and the efficiency is higher


The overall goal of Raft is to perform multi-copy replication of logs between clusters, and then apply the logs to state machines. Suppose you have a program or application that you want to run reliably, one way is to execute the program on a bunch of machines and make sure they execute the program in the same way. This is the concept of a replicated state machine.

The log can help ensure that these state machines execute the same commands in the same order.

If the client wants to execute a command z = 6, it will send this command to the consensus module of any of these servers, and then the server will store the command in the local log. In addition, he will also pass this command to other servers, other machines also store locally. After the command is safely copied to the logs, the log will be passed to the state machine for execution, and the result will be returned to the clients after the execution is completed. So we can see that as long as the logs and machines are the same, the state machines on these different machines execute the same commands in the same order and produce the same results

So the task of the consensus module is to manage these logs, ensure that they can be copied correctly, and decide when these commands can be safely passed to the state machine for execution. The reason why it is called consensus is that it is not necessary for all nodes to run at a specific time, only Most nodes are required.

image-20230305233533291


Mainly introduced from six aspects ,

The first is leader election. If we select one of the multiple servers as the leader, how do we choose a new leader after a crash?

The second point is how the system works normally when there is a leader, such as accepting client requests, log replication, etc.

The third point is leader change, which is a key part of ensuring the operation of the system. First, we will talk about what raft safe means and how to ensure it. Then we will discuss how the leader solves a log consistency problem.

The fourth point will explain another thing about leader changes, that is, how to deal with the problem of returning the old leader who has not really died to the cluster

The fifth point will discuss how clients interact with the cluster, how the client handles server crashes, and how raft ensures that each command of the client is only executed once

Finally, we will discuss changes in system members, such as adding or reducing servers

image-20230305235014376


Before talking about Raft specifically, let’s first look at server states. At any time, the server is in one of the following three states. respectively

  • Leader state, that is, manages the state of the entire cluster and client interaction and log replication
  • follower state, which is a completely passive state, passively accepts RPCs, and then only processes RPCs
  • The candidate state is an intermediate state between the two, and is used to elect the leader.

When the system is running normally, there will only be 1 leader N-1 followers. The following picture shows the conditions for the mutual transformation between the three states. I won’t go into details here, and it will be reflected in the following content. point.

image-20230306100405097


Time is divided into terms, each term has a number, and this number is incremented. Each term has two parts. At the beginning of the term, there is an election, that is, to elect a leader for the current term. If the election is successful, it will enter the second part, which is the normal external service. It can be seen that there is only one leader for each term, and there are some terms without a leader. Generally, a split vote occurs, resulting in no leader winning a majority of votes. When this happens, the system will immediately restart Try, enter a new term. Each server maintains a current term as a total sequence value, useful for identifying outdated messages.

image-20230306100654141


This picture actually basically and completely introduces the entire raft protocol.

image-20230306101742231


OK, now let's take a look at the first part of the six parts of raft, choosing the master. Raft needs to ensure that at most one machine is the real leader at any time. When a server starts up, its status is follower. In this state, it will not communicate with any other followers, but only reply to rpc. Then, in order to keep the follower in the follower state, the follower must be convinced that there is a leader now, and the way to achieve this is that it receives a message from the candidate or the leader. Therefore, in order to maintain its own survival, the leader needs to constantly communicate with other servers. When there is no specific task, it interacts through heartbeats. If the follower does not receive the rpc for a period of time, it will consider that there is no leader, and then initiate an election to see if it needs to be the leader. And this waiting time is the election timeout, usually between 100-500 ms. At the time of initialization, the entire cluster has no leader, so they all wait for the election timeout, and then start the election.

When a server starts an election, the first thing is to add its own current term and enter a new term. Of course, the first thing in the new term is to start the election, and then change itself from a follower to a candidate. What to do in the candidate state It is to make yourself a leader, that is, you need to win the majority of votes. The first thing a candidate does is to vote for himself first, and then send request vote rpcs to other servers to request votes. In the end, three things will happen. The first is that the candidate receives a majority of votes, then transforms itself into a leader state, and immediately sends heartbeats to other servers. The second is that the candidate receives an RPC from a valid leader, and then it will step down and become a follower. The third is that no one won the election. It is possible that two followers initiated the election at the same time, and finally there was a split vote, and no one won the majority of votes. When the leader has not been elected after the vote timeout, the candidate will re-increase its own term, and then initiate the election again.


image-20230306104404679

This is an election demo. There are three nodes at the beginning. After a random election timeout, node c first initiates an election when the time comes, term 0->1 vote++, and then sends request vote rpc to nodes a and b, a and b receive Then update your current term, record who voted for, and then reply to node c. Node c receives more than half of the votes and is elected as the leader, and then sends a heartbeat signal. After receiving the heartbeat signal, node a and b will reset their own timer, and update the leader, and then give a positive reply to node c.


The election process needs to meet these two attributes to ensure safety, safety and liveness. Safety says that each term has at most one leader elected. Raft achieves this by requiring each server to cast only one vote per term, which ensures that even if there are two different candidates, they cannot win a majority of votes at the same time. The second is liveness. It must be guaranteed that someone will win the election in the end. If you keep repeating the split vote, it is possible that there will never be a leader. Raft’s solution to this problem is to increase the election timeout. After the split vote, the next timeout will be between [T, 2T]. By increasing the timeout, the possibility of two servers waking up at the same time is reduced. When one wakes up first When it comes, he will have enough time to send requests to other servers to complete the election. Especially when T >> broadcast time.


The second part of Raft is how the leader performs log replication under normal operation conditions. First look at the log structure. Each server keeps its own log independently. The leader...followers... log is a sequence composed of Log entries, and the log entry is indexed by index. Inside the entry, it is a binary group (t,c) respectively It is term and command, command is the command issued by the client, and term is the value of the leader term when the log entry is created for the first time. The log is stored on a stable storage, such as a disk. When the server changes the log, it must be copied on the disk at last. Like entry 7, it is stored in most servers. At this time, the entry is considered committed. If the entry is committed, it can be safely passed to the state machine for execution. Raft can guarantee the persistence of the entry. Soon after, the entry It will be executed by the state machine on each server. In fact, this committed definition is not completely safe enough. When maintaining the consistency of the log later, it will be slightly adjusted.

image-20230306111223812

The process of Normal operation is very simple: the client sends a request to the leader and wants to execute the command into all state machines. The leader will first append the command to its own log, and then send append entries RPCs to the followers. Once it receives the response from the majority and considers that the entry has been committed, the leader will apply the command to the state machine and return the execution result to the client. The leader informs Followers entries committed, the follower executes the state machine after learning that it has been submitted. If the follower crashes or is very slow, so that the leader does not receive a reply, it can send it again after the timeout expires. Of course, the leader does not need to reply from every server, but only needs to receive replies from most nodes. This makes the overall efficiency relatively high, and will not cause the entire system to slow down due to a slow server


Raft maintains a high degree of consistency in logs, and this page lists some properties that hold true at all times. The first one is that the combination of index and term can uniquely identify a log entry, and two log entries on different servers have the same index and term, and must store the same command. In addition, except for these two, all their previous entries are also the same, so the combination of further index and term can uniquely represent a whole log. Second, if an entry is committed, all previous entries are also committed. For example, entry 5 committed is also stored in the majority according to 1.1.

image-20230306134350013

So, how are these properties guaranteed? It is enforced by Append entries consistency check . When a leader sends append entries rpc to followers, in addition to the new log entry, it will also contain two values, the index and term of the next new log entry; when the follower receives the rpc, it will only accept the next new log entry. Match on rpc. Let's look at an example. The leader has just received z<-0, and then sends an RPC to followers, which will contain the entry and (term,index) = (2,4), and the follower will check accordingly. The reception above is below of rejection.

This consistency check is very important. Through simple induction and deduction, it can be proved that a new entry can only be added when the previous entry matches, and so on. If a follower accepts a log entry from the leader, then the follower's log starts from the beginning. The log to the entry and the leader is an exact match.

image-20230306134358377

This is the end of the discussion of normal operation. Let’s discuss leader changes


Leader changes : When a new leader is elected, the log it faces may not be so neat, because the previous leader may hang up before completing the replication. Raft does not need to take special cleanup steps to solve this problem, it just needs to run normally and log repair is done during normal operation. Why not take special measures? Because when the new leader comes, there may be some downtime servers that take a long time to recover. Even if clean up is executed, it is difficult to synchronize all the logs immediately. Therefore, the normal operation of log replication that raft must design must be able to achieve final consistency. First of all, raft believes that the leader's log is always the latest and most complete. What the leader has to do is to synchronize and match the followers with its own logs. Of course, during this process, the new leader may crash and repeat several times, resulting in many complicated log entries.

As mentioned above, term and index can uniquely identify the log, so only these two can be used instead of log entry. As shown in the figure below, server 4 and 5 were elected as the leader of term 2 3 4, but for some reason they did not copy the log to the outside, and then hung up, and 4, 5 and 123 were partitioned for a period of time, and 123 became the leader of term 5 6 7 again Duplicated some logs and ended up with messy logs. But the key is only the circled part of the log, log entry 1 2 3, only these are committed and need to be maintained and saved. The others are neither applied to the state machine nor returned client results, so are not important. How server 2 is term 7 leader, eventually it will make other server logs same as him. Before discussing how to fix this messy log, let's talk about security first. How can we ensure that under such a messy log, we discard the added log but do not lose some important information, and let the system continue to run correctly? it's a matter of security

image-20230306135736536

Safety : All systems that do log replication need to guarantee one thing, that is, when a state machine receives and applies a log entry, it must ensure that other state machines cannot apply a different value to the log entry. Considering the replication process of the previous Raft protocol, this security requirement is equivalent to the fact that any two committed entries at the same index must be the same. To solve this problem, Raft implements a safety property, that is, if the leader decides that a log entry is committed, then Raft will ensure that the entry will appear in all future leader logs. If raft can satisfy this safe property, the above safety requirement can be guaranteed. How did you do it?

  1. Leaders never overwrite log entries, only append
  2. If entries want to be submitted, they must be in the leader's log. // so other values ​​are not submitted
  3. entries must be committed before apply

These several work together to meet the above requirements. However, the raft process mentioned above is not enough to meet the security needs. As shown in the picture below, committed->present in future leader's logs, the content of the raft protocol needs to be modified. First, modify the election process to ensure that the selected The new leader is the most complete. Second, modify the submission strategy. Sometimes it may be necessary to delay the submission until safety is ensured.

image-20230306142949739

Pick Up-to-Date Leader: How to choose a leader with all submitted logs? In fact, we cannot judge which server has all the submitted logs, because we cannot judge which entry has been submitted. As shown in the figure, assuming that server 3 is unavailable, whether entry 5 is submitted depends on whether it is stored in this unavailable on the server. So what we have to do is to choose the candidate who is most likely to have the most complete submitted entries as the leader.

image-20230306144155324

The specific way is to compare logs; so when a candidate sends Request vote RPCs, it needs to include the index and term of its last log entry, so that it can uniquely identify an entire log. When the voting server V receives the RPC, it will compare the candidate's log with its own local log to see which one is more complete. If the last term of the voter > the last term of the candidate, reject it; if the last term is equal, but the index of the voter is larger, it is also rejected. In addition, it is considered that the log of the candidate is more complete, and the vote will be carried out. This ensures that no matter who wins the election, it is the server with the most complete log in the cluster.

image-20230306145925086


Next, let's take a look at the practice of the new master selection method. The first is that the new leader submits an entry of the current term. For example, s1, as the leader of term 2, has copied entry 4 on server 3, which is already a majority, and then says committed, so that it can be safely applied to the state machine. Why is this safe? Because entry 4 will definitely appear in the future leaders, for example, s4 and 5 cannot be elected as the leader, one is because of term and the other is because of index. Assuming that s1 is linked to s2, s3 will be successfully elected as the leader and log replication will be completed.

image-20230306151107939

In the second case, the leader tries to submit the entry of the previous term. In this case, the leader entry 3 of term 2 only copied s1 and s2 and then hung up. For some reason, s5 did not copy entry 3 and created some local entries and then hung up. Then term 4 s1 becomes the leader again, and wants to synchronize the log, so let s3 copy entry 3 of term 2. At this moment, the leader can commit until entry 3 is stored in most nodes; however, entry 3 is not safe to submit. The reason is that, assuming that entry 3 is submitted and s1 hangs up again, it is very likely that s5 will be elected as the leader, and then the logs will be synchronized to overwrite the submitted entry 3 of term 2.

image-20230306151324586

To solve this problem, the new submission rules were modified. In addition to being stored in the majority node, at least one entry from the current leader term must have been stored in the majority node before submitting; or the new leader can commit the old entries only after submitting the entry of its own current term . Looking at this example again, after copying to most nodes, entry 3 cannot be submitted, and it needs to wait for entry 4 to be submitted before submitting. In this way, s5 has no chance to be elected as the leader, and there is no security risk. Therefore, the combination of election rules and commitment rules makes the Raft protocol satisfy security, that is, once a leader submits an entry, the entry will definitely appear in the log of the next leader in the future, and so on, it must appear in any future leader middle.

image-20230306152658244

Now that the security has been discussed, we know that the leader log is always correct, complete and up-to-date, so how to make the follower's log match the leader's log? First of all, let’s take a look at the many inconsistencies in the log; the follower may have entries, such as follower a&b&e;c, d, f, there are redundant entries. What we need to do is to delete redundant irrelevant entries and fill in missing entries.

image-20230306154009963

Repairing Follower Logs: The specific method is that the leader maintains a next index for each follower, which represents the index of the next entry to be written by each follower. After the new leader is elected, the initialization is equal to the leader's last log index + 1; next in the figure index = 10 + 1 = 11; during RPC, the sent (term, index) is actually the term, index of log[nextindex-1]. When Append Entries consistency check fails, next Index– and send RPCs again Request; for example, reject when index=11 at the beginning, reduce it to 10, and so on, until the next index becomes 5, then term and index match, and then you can write entry 5, and so on to complete A copy.

image-20230306154609045

For follower B, after an inconsistent log is replaced, subsequent useless logs will be automatically deleted. This is the whole content of the third part of leader changes. We focused on two issues. The first issue is to ensure security, mainly if a new leader is elected and how to submit it is safe; the second issue is when the new leader After the selection, how to match the followers with your own log, this process mainly depends on the Append entries consistency check.

image-20230306165651723

The fourth part of Raft is also a question about leader change. Old leaders may not really die; A new leader, but after a while, the old leader reconnected. He didn't know that the election and the new leader took place, and he would only run himself as the leader as before, such as trying to copy the log, and client interaction. How to stop it? We use Term to solve it. Each RPC must contain the sender’s term. After the receiver takes over, it will compare the term with its own current term. If the sender is older, it will reject it and send back a response containing its own term. The sender will step down after receiving it, and vice versa. The election process happens to update the term of the majority server, and even if the old server comes back, it cannot complete the majority consensus and cannot submit log entry. In short, if something is out of date, term will find it.


Next is the fifth part , which mainly discusses how the client interacts with the system. The client sends a command to the leader and gets a reply. If you don't know who is the leader, you can send it at will, and then it will return to the client who is the leader, just resend it. The leader replies to the client after completing the logged, committed and executed of the command. The more difficult part is what to do if the leader crashes during this process? The client will think that the leader has crashed, and then re-issue the command to other servers, and the other servers will eventually return the address of a new leader, and the client will send a request to the new leader to solve the problem. This ensures that a command from the client will eventually be executed.

The following is the process of log replication. At any stage of this process, there may be a leader crash.

image-20230306172441492

Special attention should be paid to the following situation: the data arrives at the leader and is copied to (0,all] follower, but the leader does not receive a response. In this case, the command may be executed twice. Because after the leader hangs up, the selected master It must be the follower that contains the log of v=3, and then complete the copy and submit execution again. During this process, because the client does not know whether it is successful, it may resend over time, and the leader cannot tell whether it is the same or two consecutive commands. Copying and submitting for execution again will cause the same command to be executed twice.

In order to solve this problem: the client needs to bind each command with a unique id, and the server will put this id in the log entry. Before accepting the command, the leader will check whether its own log has this id. If so, only You need to ignore the command and just return the previous result. In this way, idempotence is achieved and the problem of repeated execution is solved.

image-20230306172558231

For the last part , we need some way to change the system configuration. The system configuration here mainly refers to: server id network address; this determines which ones are the majority of the cluster. Why there is a member change, or a configuration change, is mainly to replace the failed machine, or change the replication factor and so on.

First of all, we need to realize that we cannot directly switch the member configuration of the cluster. As in the example below, Cold is 123, and Cnew is 12345. We want to switch from Cold to Cnew. It is difficult to switch synchronously during this process, so there may be a stage where server12 can form the majority of the old cluster, and At the same time, server123 can form the majority of the new cluster, so what will happen? There will be problems in the selection of the leader or replication at this time. Two leaders may be selected, or the log may be wrongly submitted.

image-20230306173809744

The solution is to use a two-phase protocol to complete the process. The Raft protocol stipulates that the old member state is first transferred to a transitional state called joint consensus. The member composition of this state is old∪new. At this time, the majority needs to be reached by the majority of old and the majority of new. The specific process can be seen in this picture. Our initial configuration is called Cold, and then, like other requests, the client sends a request to the leader, and the leader saves this common and consistent configuration in the log, that is, Cold U Cnew, and then sends it to other followers Sending RPC is the same as the normal log replication process. The only difference is that it works immediately. When the log is configured locally, the server will immediately use it as the current cluster configuration without waiting for submission. When was Cold+new committed? That is, the aforementioned majority of old and majority of new; here, a new leader in Cold may be elected during the stage when cold is active, but it does not affect the correctness, because according to the log index, the elected leader must contain this member change log Machine; when the intermediate state is committed, the entire cluster is running under joint consensus. At this time, the leader can change the members to Cnew. As before, the log is then copied. After a period of time, Cnew is committed, and the member changes are completed. Most of the decisions are based on watching Cnew.

One-stage: Two-stage is used because there is no restriction on C_old and C_new, and C_old and C_new can each form a disjoint majority to select two Leaders. The two-stage process ensures that Cold and Cnew cannot generate two majorities without conflicts. Or another method, restricting only one member to be added or deleted at a time, so that two disjoint majorities must not be formed, and at the same time restricting that the next member change is not allowed until a member change is successful.


That's all for six parts. To briefly review, in the first leader election, we ensure that at most one server in a certain term can be elected as the leader; the second part mainly introduces the following normal operation after the election of the leader, including accepting client requests and log replication, mentioned An important consistency check, the Append Entries consistency check, proves that the index and term can uniquely identify the log; the third part discusses the two problems brought about by the leader change. The first is how to ensure security, namely When a log entry committed, he will always appear in the following leader. There is also the issue of consistency, that is, how to make the log of the follower and the log of the leader the same. The fourth part, how to ensure that the old leader who is not really dead will not affect the system. The fifth part briefly talks about what the client does and high availability in the case of leader crash. Finally, methods for safely making membership changes are discussed.
In the second part, we discussed the leader change, which brings about two problems. The first one is how to ensure security, that is, when a log entry is committed, it will always appear in the following leader. There is also the issue of consistency, that is, how to make the log of the follower and the log of the leader the same. The fourth part, how to ensure that the old leader who is not really dead will not affect the system. The fifth part briefly talks about what the client does and high availability in the case of leader crash. Finally, methods for safely making membership changes are discussed.

Guess you like

Origin blog.csdn.net/qq_47865838/article/details/129368121