Foreword
ZooKeeper has its own standby model to achieve Zab agreement, namely Leader and learner (Observer + Follower), there are several cases need to be election leader
- Case 1: In the process of starting a cluster need to elect the Leader
- Scenario 2: After the cluster starts normally, leader because of failure hung up, needed to be elected Leader
- Scenario 3: The number of Follower cluster is not sufficient by half test, Leader will hang themselves, elect a new leader
- Scenario 4: The cluster is working properly, a new addition Follower
This blog, these four aspects of reading source tracking
Program entry
QuorumPeer.java
Each node corresponds to a server cluster, in its start()
process, to complete the start-up operation of the current node, the following source code:
// todo 进入了 QuorumPeer(意为仲裁人数)类中,可以把这个类理解成集群中的某一个点
@Override
public synchronized void start() {
// todo 从磁盘中加载数据到内存中
loadDataBase();
// todo 启动上下文的这个工厂,他是个线程类, 接受客户端的请求
cnxnFactory.start();
// todo 开启leader的选举工作
startLeaderElection();
// todo 确定服务器的角色, 启动的就是当前类的run方法在900行
super.start();
}
The first loadDataBase();
aim is to restore the data from the cluster into memory
The second cnxnFactory.start();
is the current node may accept the connection request from the client (java code or console) sent by
The third startLeaderElection();
turned leader of the election , but in fact he is initialized a series of helper classes, aid to the election of the leader, is not really in the election
The current class quorumPeer
inherits ZKThread, which is itself a thread class, super.start();
is to start its run method, there is a while loop in his Run method, beginning at the stage of proceedings, default values for all of the nodes are Looking
, so will enter the branch, in this would divide the genuine leader election
summary
From the entrance introduction of the program, can be seen in this article will focus on look at startLeaderElection();
what had been done? And in looking
how the electoral branch leader
Case 1: In the process of starting a cluster, the election of a new Leader
Into the startLeaderElection();
method, source code follows, he mainly do two things
- This type of
QuorumPeer.java
variable maintenance (volatile private Vote currentVote;
) initialization createElectionAlgorithm()
Creating a leader election methodIn fact, up to now, leaving an algorithm not expired, that is,
fastLeaderElection
// TODO 开启投票选举Leader的工作
synchronized public void startLeaderElection() {
try {
// todo 创建了一个封装了投票结果对象 包含myid 最大的zxid 第几轮Leader
// todo 先投票给自己
// todo 跟进它的构造函数
currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
} catch(IOException e) {
RuntimeException re = new RuntimeException(e.getMessage());
re.setStackTrace(e.getStackTrace());
throw re;
}
for (QuorumServer p : getView().values()) {
if (p.id == myid) {
myQuorumAddr = p.addr;
break;
}
}
if (myQuorumAddr == null) {
throw new RuntimeException("My id " + myid + " not in the peer list");
}
if (electionType == 0) {
try {
udpSocket = new DatagramSocket(myQuorumAddr.getPort());
responder = new ResponderThread();
responder.start();
} catch (SocketException e) {
throw new RuntimeException(e);
}
}
// todo 创建一个领导者选举的算法,这个算法还剩下一个唯一的实现 快速选举
this.electionAlg = createElectionAlgorithm(electionType);
}
Continue to follow up createElectionAlgorithm(electionType)
, in this method, do the following three things
- created
QuorumCnxnManager
- create
Listenner
- create
FastLeaderElection
protected Election createElectionAlgorithm(int electionAlgorithm){
Election le=null;
//TODO: use a factory rather than a switch
switch (electionAlgorithm) {
case 0:
le = new LeaderElection(this);
break;
case 1:
le = new AuthFastLeaderElection(this);
break;
case 2:
le = new AuthFastLeaderElection(this, true);
break;
case 3:
// todo 创建CnxnManager 上下文的管理器
qcm = createCnxnManager();
QuorumCnxManager.Listener listener = qcm.listener;
if(listener != null){
// todo 在这里将listener 开启
listener.start();
// todo 实例化领导者选举的算法
le = new FastLeaderElection(this, qcm);
} else {
LOG.error("Null listener when initializing cnx manager");
}
break;
Prepare electoral environment
QuorumManager
The figure is a class diagram QuorumCnxManager, look, it has six internal class, except where the Message
thread class alia are run separately
This class has a pivotal role, it is shared auxiliary classes in all nodes in the cluster, and that in the end what role do? I'm not guessing just say, because the election leader is a resolution out of the vote, as we have to vote each other, and that cluster each point you have to establish a connection between any two, that QuorumCnxManager
it is responsible for maintaining communication at various points in the cluster
It maintains two queues, source below, is stored in the first queue ConcurrentHashMap
in the node is MyID key (or a serverId), it will be appreciated that the value transmitted to the vote server queue to store the other
The second is to receive the other server queue to send over the msg
// todo key=serverId(myid) value = 保存着当前服务器向其他服务器发送消息的队列
final ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>> queueSendMap;
// todo 接收到的所有数据都在这个队列中
public final ArrayBlockingQueue<Message> recvQueue;
Pictured above is a hand-drawn QuorumCnxManager.java
system diagram, the most intuitive of the three thread class can see inside it, and that three of the thread class run () method and were doing anything at all?
SendWorker the RUN () , it can be seen that the current node extracted in accordance with a queue corresponding sid, and then send the data out queue
public void run() {
threadCnt.incrementAndGet();
try {
ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);
if (bq == null || isSendQueueEmpty(bq)) {
ByteBuffer b = lastMessageSent.get(sid);
if (b != null) {
LOG.debug("Attempting to send lastMessage to sid=" + sid);
send(b);
}
}
} catch (IOException e) {
LOG.error("Failed to send last message. Shutting down thread.", e);
this.finish();
}
try {
while (running && !shutdown && sock != null) {
ByteBuffer b = null;
try {
// todo 取出任务所在的队列
ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);
if (bq != null) {
// todo 将bq,添加进sendQueue
b = pollSendQueue(bq, 1000, TimeUnit.MILLISECONDS);
} else {
LOG.error("No queue of incoming messages for " +
"server " + sid);
break;
}
if(b != null){
lastMessageSent.put(sid, b);
// todo
send(b);
}
} catch (InterruptedException e) {
LOG.warn("Interrupted while waiting for message on queue",
e);
}
}
RecvWorker run method , the received msg, msg then into the recvQueue
queue
public void run() {
threadCnt.incrementAndGet();
try {
while (running && !shutdown && sock != null) {
/**
* Reads the first int to determine the length of the
* message
*/
int length = din.readInt();
if (length <= 0 || length > PACKETMAXSIZE) {
throw new IOException(
"Received packet with invalid packet: "
+ length);
}
/**
* Allocates a new ByteBuffer to receive the message
*/
// todo 从数组中把数据读取到数组中
byte[] msgArray = new byte[length];
din.readFully(msgArray, 0, length);
// todo 将数组包装成ByteBuf
ByteBuffer message = ByteBuffer.wrap(msgArray);
// todo 添加到RecvQueue中
addToRecvQueue(new Message(message.duplicate(), sid));
}
]
Listenner the RUN () , it will use the port (above 3888) use the communication key cluster we configured in the configuration file to establish a connection between each other
Also found that the use of conventional communication socket communication between various points in the cluster
InetSocketAddress addr;
while((!shutdown) && (numRetries < 3)){
try {
// todo 创建serversocket
ss = new ServerSocket();
ss.setReuseAddress(true);
if (listenOnAllIPs) {
int port = view.get(QuorumCnxManager.this.mySid)
.electionAddr.getPort();
//todo 它取出来的地址就是address就是我们在配置文件中配置集群时添加进去的 port 3888...
addr = new InetSocketAddress(port);
} else {
addr = view.get(QuorumCnxManager.this.mySid)
.electionAddr;
}
LOG.info("My election bind port: " + addr.toString());
setName(view.get(QuorumCnxManager.this.mySid)
.electionAddr.toString());
// todo 绑定端口
ss.bind(addr);
while (!shutdown) {
// todo 阻塞接受其他的服务器发起连接
Socket client = ss.accept();
setSockOpts(client);
LOG.info("Received connection request "
+ client.getRemoteSocketAddress());
// todo 如果启用了仲裁SASL身份验证,则异步接收和处理连接请求
// todo 这是必需的,因为sasl服务器身份验证过程可能需要几秒钟才能完成,这可能会延迟下一个对等连接请求。
if (quorumSaslAuthEnabled) {
// todo 异步接受一个连接
receiveConnectionAsync(client);
} else {
// todo 跟进这个方法
receiveConnection(client);
}
numRetries = 0;
}
Continue to follow up the source code back to QuorumPeer.java
the createElectionAlgorithm()
process, re-intercepts source follows the completion of the QuorumCnxManager
creation, after the start Listener, Listenner start marks have between each node in the cluster to establish communication capability between any two at the same time Listenner a thread class, its Run () method in the above code
FastLeaderElection
After starting Listenner, began to instantiate objects leader election algorithmnew FastLeaderElection(this, qcm)
...
break;
case 3:
// todo 创建CnxnManager 上下文的管理器
qcm = createCnxnManager();
QuorumCnxManager.Listener listener = qcm.listener;
if(listener != null){
// todo 在这里将listener 开启
listener.start();
// todo 实例化领导者选举的算法
le = new FastLeaderElection(this, qcm);
} else {
LOG.error("Null listener when initializing cnx manager");
}
Below is a FasterElection
class diagram
Directly visually see that three internal classes
- Messager (which have two internal threads classes)
- WorkerRecriver
- Responsible for
- Responsible for
- WorkerSender
- WorkerRecriver
- Notification
- Usually start when a new node status is looking, then polls resolution, after other node receives will be used
Notification
to tell its own trusted leader
- Usually start when a new node status is looking, then polls resolution, after other node receives will be used
- ToSend
- Send each other, or messages from other nodes. The message may be a notification, the notification may be received ack
Correspond to QuorumCnxManager
two kinds of queue maintenance, FasterElection
likewise with care to maintain the following two queues, one sendqueue
other isrecvqueue
LinkedBlockingQueue<ToSend> sendqueue;
LinkedBlockingQueue<Notification> recvqueue;
Specifically how to play? As shown below
When a node is starting the process of external voting will be credited FasterElection
in sendqueue
, and then after QuorumCnxManager
the sendWorker
sending out through the NIO, the opposite process, voting other nodes receive will be QuorumCnxManager
of recvWorker
receipt, and then into QuorumCnxManager
the recvQueue中
this queue msg will continue to be FasterElection
an internal thread class workerRecviver
taken to store FasterElection
therecvqueue中
By tracking code can be found, Message of two internal threads are daemon thread as a way to open
Messenger(QuorumCnxManager manager) {
// todo WorderSender 作为一条新的线程
this.ws = new WorkerSender(manager);
Thread t = new Thread(this.ws,
"WorkerSender[myid=" + self.getId() + "]");
t.setDaemon(true);
t.start();
//todo------------------------------------
// todo WorkerReceiver 作为一条新的线程
this.wr = new WorkerReceiver(manager);
t = new Thread(this.wr,
"WorkerReceiver[myid=" + self.getId() + "]");
t.setDaemon(true);
t.start();
}
summary
Code see here, in fact, preparations for the elections leader has been completed, which means that quorumPeer.java
the start()
method startLeaderElection();
is ready to environmental leadership election, it is on the map
The real beginning of the election
Here take a look quorumPeer.java
in this thread classes start, part of run()
the interception method, we are concerned about its lookForLeader()
method
while (running) {
switch (getPeerState()) {
/**
* todo 四种可能的状态, 经过了leader选举之后, 不同的服务器就有不同的角色
* todo 也就是说,不同的服务器会会走动下面不同的分支中
* LOOKING 正在进行领导者选举
* Observing
* Following
* Leading
*/
case LOOKING:
// todo 当为Looking状态时,会进入领导者选举的阶段
LOG.info("LOOKING");
if (Boolean.getBoolean("readonlymode.enabled")) {
LOG.info("Attempting to start ReadOnlyZooKeeperServer");
// Create read-only server but don't start it immediately
// todo 创建了一个 只读的server但是不着急立即启动它
final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(
logFactory, this,
new ZooKeeperServer.BasicDataTreeBuilder(),
this.zkDb);
// Instead of starting roZk immediately, wait some grace(优雅) period(期间) before we decide we're partitioned.
// todo 为了立即启动roZK ,在我们决定分区之前先等一会
// Thread is used here because otherwise it would require changes in each of election strategy classes which is
// unnecessary code coupling.
//todo 这里新开启一条线程,避免每一个选举策略类上有不同的改变 而造成的代码的耦合
Thread roZkMgr = new Thread() {
public void run() {
try {
// lower-bound grace period to 2 secs
sleep(Math.max(2000, tickTime));
if (ServerState.LOOKING.equals(getPeerState())) {
// todo 启动上面那个只读的Server
roZk.startup();
}
} catch (InterruptedException e) {
LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
} catch (Exception e) {
LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
}
}
};
try {
roZkMgr.start();
setBCVote(null);
// todo 上面的代码都不关系,直接看它的 lookForLeader()方法
// todo 直接点进去,进入的是接口,我们看它的实现类
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
setPeerState(ServerState.LOOKING);
} finally {
// If the thread is in the the grace period, interrupt
// to come out of waiting.
roZkMgr.interrupt();
roZk.shutdown();
}
Here is the lookForLeader()
source code interpretation
to tell the truth this approach really is quite long, but this method it is really important, because we can find on the web you sum up the election for the Leader of bits and pieces from this method
The first point: every vote will cast their first vote, it plainly new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
own myid, the largest zxid, and the first sessions of the package together, but there is one detail, it is cast in their own, or will exist It has its own information of this vote sends to other nodes through the socket
Accept other people's vote by QuorumManager
the recvWorker
Thread class will vote added to recvQueue
the queue, when the vote for yourself, do not go this route, but rather choose to add directly into the ticket recvQueue
queue
There is a line in the following code, HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
this map can be understood as a small mailbox, each node will maintain a mailbox, there may store their own vote for their ticket, or someone else vote for their ticket, or someone else to vote for someone else's ticket or vote for their own people's votes, this statistical mailbox number of votes can decide whether a particular node can be a leader, as the source, use the information in the mailbox,
// todo 根据别人的投票,以及自己的投票判断,本轮得到投票的集群能不能成为leader
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// todo 到这里说明接收到投票的机器已经是准leader了
// Verify if there is any change in the proposed leader
// todo 校验一下, leader有没有变动
while ((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null) {
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
recvqueue.put(n);
break;
}
}
if (n == null) {
// todo 判断自己是不是leader, 如果是,更改自己的状态未leading , 否则根据配置文件确定状态是 Observer 还是Follower
// todo leader选举出来后, QuorumPeer中的run方法中的while再循环,不同角色的服务器就会进入到 不同的分支
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING : learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid,
logicalclock.get(),
proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
In the termPredicate()
following logic function, self.getQuorumVerifier().containsQuorum(set);
its implementation follows, in fact, more than half of the mechanism during the test, the conclusion is that when a node has a vote in more than half of the nodes in the cluster, and it will revise their status leading, the other nodes according to their needs will become the statefollowing或者observing
public boolean containsQuorum(Set<Long> set){
return (set.size() > half);
}
Maintains a clock, marking how many times voted logicalclock
He is AutomicLong types of variables, he had what use is it? Logic can be seen as the following code, that is, when their clock reception than the current clock hours to vote , indicating that he might for other reasons missed a ballot, so update its own clock, or to re-cast their vote judge others , empathy, if the clock received votes less than their current clock, indicating that this vote does not make sense direct discard ignore
if (n.electionEpoch > logicalclock.get()) {
// todo 将自己的时钟调整为更新的时间
logicalclock.set(n.electionEpoch);
// todo 清空自己的投票箱
recvset.clear();
So according to what is judged to vote for their own or vote for someone else? Information by parsing class ticket package encapsulated nodes What information? Zxid, myid, epoch often the case that epoch become a big priority Leader , general the same epoch will be so large zxid priorities become leader, if zxid again the same, the priority became leader of a large myid
Check the node to another more suitable than myself as leader, will be re-vote, the election is more suitable node
Complete source code
// todo 当前进入的是FastLeaderElection.java的实现类
public Vote lookForLeader() throws InterruptedException {
try {
// todo 创建用来选举Leader的Bean
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(
self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
self.jmxLeaderElectionBean = null;
}
if (self.start_fle == 0) {
self.start_fle = Time.currentElapsedTime();
}
try {
// todo 每台服务器独有的投票箱 , 存放其他服务器投过来的票的map
// todo long类型的key (sid)标记谁给当前的server投的票 Vote类型的value 投的票
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = finalizeWait;
synchronized (this) {
//todo Automic 类型的时钟
logicalclock.incrementAndGet();
//todo 一开始启动时,入参位置的值都取自己的,相当于投票给自己
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
// todo 发送出去,投票自己
sendNotifications();
/*
* Loop in which we exchange notifications until we find a leader
*/
// todo 如果自己一直处于LOOKING的状态,一直循环
while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
//todo 尝试获取其他服务器的投票的信息
// todo 从接受消息的队列中取出一个msg(这个队列中的数据就是它投票给自己的票)
// todo 在QuorumCxnManager.java中 发送的投票的逻辑中,如果是发送给自己的,就直接加到recvQueue,而不经过socket
// todo 所以它在这里是取出了自己的投票
Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
// todo 第一轮投票这里不为空
if (n == null) {
// todo 第二轮就没有投票了,为null, 进入这个分支
// todo 进行判断 ,如果集群中有三台服务器,现在仅仅启动一台服务器,还剩下两台服务器没启动
// todo 那就会有3票, 其中1票直接放到 recvQueue , 另外两票需要发送给其他两台机器的逻辑就在这里判断
// todo 验证是通不过的,因为queueSendMap中的两条队列都不为空
if (manager.haveDelivered()) {
sendNotifications();
} else {
// todo 进入这个逻辑
manager.connectAll();
}
/*
* Exponential backoff
*/
int tmpTimeOut = notTimeout * 2;
notTimeout = (tmpTimeOut < maxNotificationInterval ?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
} else if (validVoter(n.sid) && validVoter(n.leader)) {
// todo 收到了其他服务器的投票信息后,来到下面的分支中处理
/*
* Only proceed if the vote comes from a replica in the
* voting view for a replica in the voting view.
* todo 仅当投票来自投票视图中的副本时,才能继续进行投票。
*/
switch (n.state) {
case LOOKING:
// todo 表示获取到投票的服务器的状态也是looking
// If notification > current, replace and send messages out
// todo 对比接收到的头片的 epoch和当前时钟先后
// todo 接收到的投票 > 当前服务器的时钟
// todo 表示当前server在投票过程中可能以为故障比其他机器少投了几次,需要重新投票
if (n.electionEpoch > logicalclock.get()) {
// todo 将自己的时钟调整为更新的时间
logicalclock.set(n.electionEpoch);
// todo 清空自己的投票箱
recvset.clear();
// todo 用别人的信息和自己的信息对比,选出一个更适合当leader的,如果还是自己适合,不作为, 对方适合,修改投票,投 对方
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
sendNotifications();
// todo 接收到的投票 < 当前服务器的时钟
// todo 说明这个投票已经不能再用了
} else if (n.electionEpoch < logicalclock.get()) {
if (LOG.isDebugEnabled()) {
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
// todo 别人的投票时钟和我的时钟是相同的
// todo 满足 totalOrderPredicate 后,会更改当前的投票,重新投票
/**
* 在 totalOrderPredicate 比较两者之间谁更满足条件
* ((newEpoch > curEpoch) ||
* ((newEpoch == curEpoch) &&
* ((newZxid > curZxid) ||
* ((newZxid == curZxid) &&
* (newId > curId)))));
*/
// todo 返回true说明 对方更适合当leader
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
if (LOG.isDebugEnabled()) {
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
// todo 将自己的投票存放到投票箱子中
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
// todo 根据别人的投票,以及自己的投票判断,本轮得到投票的集群能不能成为leader
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// todo 到这里说明接收到投票的机器已经是准leader了
// Verify if there is any change in the proposed leader
// todo 校验一下, leader有没有变动
while ((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null) {
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
// todo 判断自己是不是leader, 如果是,更改自己的状态未leading , 否则根据配置文件确定状态是 Observer 还是Follower
// todo leader选举出来后, QuorumPeer中的run方法中的while再循环,不同角色的服务器就会进入到 不同的分支
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING : learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid,
logicalclock.get(),
proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING:
// todo 禁止Observer参加投票
LOG.debug("Notification from observer: " + n.sid);
break;
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
if (n.electionEpoch == logicalclock.get()) {
recvset.put(n.sid, new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch));
if (ooePredicate(recvset, outofelection, n)) {
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING : learningState());
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
/*
* Before joining an established ensemble, verify
* a majority is following the same leader.
*/
outofelection.put(n.sid, new Vote(n.version,
n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch,
n.state));
if (ooePredicate(outofelection, outofelection, n)) {
synchronized (this) {
logicalclock.set(n.electionEpoch);
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING : learningState());
}
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
n.state, n.sid);
break;
}
} else {
if (!validVoter(n.leader)) {
LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
}
if (!validVoter(n.sid)) {
LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
}
}
}
return null;
After the above judgment of each node can elect a different role, to return to QuorumPeer.java
the run()
time in circulation, no longer enter case LOOKING:
the code block, but to carry out their duties in accordance with their different roles and perform different initialization start
Scenario 2: After the cluster starts normally, leader hung up because of a failure to elect a new Leader
The second election leader of the cases, after the cluster starts normally, leader hung up due to failure to elect a new Leader
This part of the logic of what is it?
Although the leader hung up, but the role of the server will still go Follower execution QuorumPeer.java
of run()
an infinite while loop method, when it executes follower.followLeader();
can not find a method leader, will be abnormal, the final implementation of finally
the code block logic, you can see it modify their status as looking, and then re-elected leader
break;
case FOLLOWING:
// todo server 当选follow角色
try {
LOG.info("FOLLOWING");
setFollower(makeFollower(logFactory));
follower.followLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
follower.shutdown();
setFollower(null);
setPeerState(ServerState.LOOKING);
}
break;
Scenario 3: The number of Follower cluster is not sufficient by half test, Leader will hang himself, and then elect a new leader
Case 3: Suppose a cluster 2 Follower, Taiwan leader, then hang up when a Follower, remaining unable to meet more than half of Taiwan Follower check mechanism will therefore re-elected leader
Back Source: leader every time enter case LEADING:
to performleader.lead();
case LEADING:
// todo 服务器成功当选成leader
LOG.info("LEADING");
try {
setLeader(makeLeader(logFactory));
// todo 跟进lead
leader.lead();
setLeader(null);
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
if (leader != null) {
leader.shutdown("Forcing shutdown");
setLeader(null);
}
setPeerState(ServerState.LOOKING);
}
break;
But leader.lead();
in each execution will make the following judgment, it is clear that when the test is not met half, leader directly to hang himself, the final status of all nodes in the cluster will be changed to LOOKING
re-election
if (!tickSkip && !self.getQuorumVerifier().containsQuorum(syncedSet)) {
//if (!tickSkip && syncedCount < self.quorumPeers.size() / 2) {
// Lost quorum, shutdown
shutdown("Not sufficient followers synced, only synced with sids: [ "
+ getSidSetString(syncedSet) + " ]");
// make sure the order is the same!
// the leader goes to looking
return;
}
Scenario 4: The cluster is working properly, a new addition Follower
Follower new additions coming in at the start of its state is looking, she would try the same election leader, the same will first vote for himself, but for a stable cluster is
a cluster of each orange has been determined down, in this case, we will enter FastLeaderElection.java
the lookForLeader()
following branches of the method, so that the current add in new nodes
recognized directly Leader
case OBSERVING:
// todo 禁止Observer参加投票
LOG.debug("Notification from observer: " + n.sid);
break;
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
if (n.electionEpoch == logicalclock.get()) {
recvset.put(n.sid, new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch));
if (ooePredicate(recvset, outofelection, n)) {
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING : learningState());
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
If there is an error please point out, if it helps, welcome point for your support