一、概述
zookeeper集群中,每个zookeeper节点通过投票的方式,选举出1个Leader,确定Leader后,其余参与选举的zk节点则为Follower。
二、选举过程分析
最开始,当前zk节点会投票给自己,并决定选举用的算法。
QuorumPeer.startLeaderElection:
synchronized public void startLeaderElection() {
try {
//投票给自己
currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
} catch(IOException e) {
RuntimeException re = new RuntimeException(e.getMessage());
re.setStackTrace(e.getStackTrace());
throw re;
}
for (QuorumServer p : getView().values()) {
if (p.id == myid) {
myQuorumAddr = p.addr;
break;
}
}
if (myQuorumAddr == null) {
throw new RuntimeException("My id " + myid + " not in the peer list");
}
if (electionType == 0) {
try {
udpSocket = new DatagramSocket(myQuorumAddr.getPort());
responder = new ResponderThread();
responder.start();
} catch (SocketException e) {
throw new RuntimeException(e);
}
}
//确定选举算法
this.electionAlg = createElectionAlgorithm(electionType);
}
选举算法默认是FastLeaderElection:
QuorumPeer.createEletionAlgorithm:
protected Election createElectionAlgorithm(int electionAlgorithm){
Election le=null;
//TODO: use a factory rather than a switch
switch (electionAlgorithm) {
case 0:
le = new LeaderElection(this);
break;
case 1:
le = new AuthFastLeaderElection(this);
break;
case 2:
le = new AuthFastLeaderElection(this, true);
break;
case 3:
//连接管理器--管理参与选举的ZooKeeper实例之间的连接
qcm = createCnxnManager();
QuorumCnxManager.Listener listener = qcm.listener;
if(listener != null){
listener.start();
//默认算法
le = new FastLeaderElection(this, qcm);
} else {
LOG.error("Null listener when initializing cnx manager");
}
break;
default:
assert false;
}
return le;
}
在QuorumPeer.run方法中,根据节点状态的不同,采取不同的行为。当状态为LOOKING,则进行选举:
while (running) {
switch (getPeerState()) {
case LOOKING:
LOG.info("LOOKING");
//无关代码省略
try {
setBCVote(null);
//状态为LOOKING,通过选举算法得出当前的Leader投票
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
setPeerState(ServerState.LOOKING);
}
break;
于是,选举的过程主要在FastLeaderElection.lookForLeader()方法中:
/**
* Starts a new round of leader election. Whenever our QuorumPeer
* changes its state to LOOKING, this method is invoked, and it
* sends notifications to all other peers.
*/
public Vote lookForLeader() throws InterruptedException {
try {
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(
self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
self.jmxLeaderElectionBean = null;
}
if (self.start_fle == 0) {
self.start_fle = Time.currentElapsedTime();
}
try {
//本轮选举的所有投票 sid -> sid提议的Leader票
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
//当前节点为LEADING状态,其他节点投新Leader的票
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = finalizeWait;
synchronized(this){
//logicallock代表整个集群的第几次选举,初始为0,每次选举加1
logicalclock.incrementAndGet();
//初始提议自己为Leader
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
//通知集群所有节点(也发给自己),这轮选举,当前节点提议的Leader
sendNotifications();
/*
* Loop in which we exchange notifications until we find a leader
*/
//持续选举过程直到确定Leader为止
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)){
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
//等待接收集群节点的通知
Notification n = recvqueue.poll(notTimeout,
TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
//没收到消息
if(n == null){
//消息发出去了,却没收到响应,再发一次
if(manager.haveDelivered()){
sendNotifications();
} else {
//消息没发出去,说明连接可能有问题,重建连接
manager.connectAll();
}
/*
* Exponential backoff
*/
//网络不太好?延长等待时间
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
//接着,进入下一次while循环
}
else if(validVoter(n.sid) && validVoter(n.leader))
/*
* Only proceed if the vote comes from a replica in the
* voting view for a replica in the voting view.
*/
//投票节点的状态
switch (n.state) {
case LOOKING: //如果投票节点状态为LOOKING
// If notification > current, replace and send messages out
//节点n的选举轮数大于当前节点认为的轮数,即当前节点的投票过时了
if (n.electionEpoch > logicalclock.get()) {
//更新选举轮数
logicalclock.set(n.electionEpoch);
//这轮收到的投票作废,因为logicallock过时了
recvset.clear();
//和n.sid提议的leader pk
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
//n.leader胜出,投给n.leader
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
//自己胜出,投自己
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
//通知所有节点自己的提议
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {
//通知过时了,直接忽略
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
//忽略啦
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
//同一轮选举,由于n.leader更优,改为投给n.leader
updateProposal(n.leader, n.zxid, n.peerEpoch);
//发通知
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
//记录n.sid的投票
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//是否可以结束投票?默认条件是,投票的zk实例超过集群半数
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
//在finalizeWait期间,投票发生变动
// Verify if there is any change in the proposed leader
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
//没再收到新的Notification,确定状态,
//如果推测是自己,则确认自己是Leader,否则为Follower
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid,
logicalclock.get(),
proposedEpoch);
leaveInstance(endVote);
//返回选举结果
return endVote;
}
}
break;
case OBSERVING:
LOG.debug("Notification from observer: " + n.sid);
break;
case FOLLOWING:
case LEADING: //如果为LEADING状态
/*
* Consider all notifications from the same epoch
* together.
*/
//如果是同一轮选举
if(n.electionEpoch == logicalclock.get()){
//记录选票
recvset.put(n.sid, new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch));
//新Leader是否有效,根据投票是否过半,以及n.leader是否投自己为leader
//只要n.leader给自己投了leader,当前节点就会信任n.leader
if(ooePredicate(recvset, outofelection, n)) {
//切换n.leader为Leader
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
//记录新Leader的票
/*
* Before joining an established ensemble, verify
* a majority is following the same leader.
*/
outofelection.put(n.sid, new Vote(n.version,
n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch,
n.state));
//超过半数选举了新Leader,并且新Leader有效
if(ooePredicate(outofelection, outofelection, n)) {
synchronized(this){
logicalclock.set(n.electionEpoch);
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
}
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
n.state, n.sid);
break;
}
} else {
if (!validVoter(n.leader)) {
LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
}
if (!validVoter(n.sid)) {
LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
}
}
}
return null;
} finally {
try {
if(self.jmxLeaderElectionBean != null){
MBeanRegistry.getInstance().unregister(
self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
LOG.debug("Number of connection processing threads: {}",
manager.getConnectionThreadCount());
}
}
三、为什么zookeeper集群至少3个节点?
从选举过程看出,同一轮选举过程中,节点自身(self)会和投票对象(proposal)pk,如果proposal pk胜出,则自己改投proposal,并发通知,否则不再发通知。
如果集群中只有两个节点 A和B,A的serverId小于B的serverId。
- A启动,投给自己。此时有:
A – votes: (A, A),proposal: A - B启动,投给自己 。此时有:
A – votes: (A, A) (B, B), proposal: B
B – votes: (B, B),proposal: B
此时,A推荐B为Leader,但B无法收到通知,于是无法确认自己是否当选了Leader,即产生了脑裂。分析可知,zk集群个数至少要 > 3。