raft的leader的选举,依赖于心跳包的超时,etcd-raft的周期性心跳信号由定时器产生,该定时器在扩展的raftNode(raft.node实现raft.Node接口)类中启动,由于raftNode节点一般为扩展etcd-raft算法的结合需求的可定制类,之所以把tick的产生信号放在raftNode主要是方便修改心跳超时时间时不用修改raft协议核心代码,可以见EtcdServer的raftNode:
// etcd-2.3.7/etcdserver/raft.go
// start prepares and starts raftNode in a new goroutine. It is no longer safe
// to modify the fields after it has been started.
// TODO: Ideally raftNode should get rid of the passed in server structure.
func (r *raftNode) start(s *EtcdServer) {
// ...
heartbeat := 200 * time.Millisecond
if s.cfg != nil {
heartbeat = time.Duration(s.cfg.TickMs) * time.Millisecond
}
// set up contention detectors for raft heartbeat message.
// expect to send a heartbeat within 2 heartbeat intervals.
r.td = contention.NewTimeoutDetector(2 * heartbeat)
go func() {
var syncC <-chan time.Time
defer r.onStop()
islead := false
for {
select {
case <-r.ticker:
r.Tick()
//....
} // select
} // for
}()
}
etcd-raft算法模块接收这个信号是在,raft.node.run方法中,通过node.ntick channel接收然后调用raft.raft的tick:
// etcd-2.3.7/raft/node.go
// raftNode就是调用node的这个方法往里面写入周期的心跳信号
func (n *node) Tick() {
select {
case n.tickc <- struct{}{}:
case <-n.done:
}
}
func (n *node) run(r *raft) {
// ...
for {
// ...
select {
// ...
case <-n.tickc:
r.tick()
// ...
} // select
} // for
}
raft.node在收到tickc的信号之后最终会调用raft.tick,由于raft.tick是一个函数变量,在该raft处于不同的角色时tick指向的函数不一样,由于本节记录的选举,所以当前节点应该处于candidate阶段,这个阶段应该调用的是tickElection,从下面几个函数里面 可得到相关tick的变化信息:
// etcd-2.3.7/raft/raft.go
func (r *raft) becomeFollower(term uint64, lead uint64) {
r.step = stepFollower
r.reset(term)
r.tick = r.tickElection
r.lead = lead
r.state = StateFollower
r.logger.Infof("%x became follower at term %d", r.id, r.Term)
}
func (r *raft) becomeCandidate() {
// TODO(xiangli) remove the panic when the raft implementation is stable
if r.state == StateLeader {
panic("invalid transition [leader -> candidate]")
}
r.step = stepCandidate
r.reset(r.Term + 1)
r.tick = r.tickElection
r.Vote = r.id
r.state = StateCandidate
r.logger.Infof("%x became candidate at term %d", r.id, r.Term)
}
func (r *raft) becomeLeader() {
// TODO(xiangli) remove the panic when the raft implementation is stable
if r.state == StateFollower {
panic("invalid transition [follower -> leader]")
}
r.step = stepLeader
r.reset(r.Term)
r.tick = r.tickHeartbeat
r.lead = r.id
r.state = StateLeader
ents, err := r.raftLog.entries(r.raftLog.committed+1, noLimit)
if err != nil {
r.logger.Panicf("unexpected error getting uncommitted entries (%v)", err)
}
for _, e := range ents {
if e.Type != pb.EntryConfChange {
continue
}
if r.pendingConf {
panic("unexpected double uncommitted config entry")
}
r.pendingConf = true
}
r.appendEntry(pb.Entry{Data: nil})
r.logger.Infof("%x became leader at term %d", r.id, r.Term)
}
接下来看下tickElection,主要是构造了一个MsgHup类型的消息,然后把消息传给raft.Step:
// etcd-2.3.7/raft/raft.go
// tickElection is run by followers and candidates after r.electionTimeout.
func (r *raft) tickElection() {
if !r.promotable() {
r.electionElapsed = 0
return
}
r.electionElapsed++
if r.isElectionTimeout() {
r.electionElapsed = 0
r.Step(pb.Message{From: r.id, Type: pb.MsgHup})
}
}
接下来看下raft.Step的实现,当消息类型是MsgHup的时候会根据当前节点的state来判断走什么逻辑,由于当前节点处于还没有开始竞选,所以状态state==StateFollower(函数:becomeCandidate,becomeFollower),然后进入campaign函数开始为了竞选leader竞争:
// etcd-2.3.7/raft/raft.go
func (r *raft) Step(m pb.Message) error {
if m.Type == pb.MsgHup {
if r.state != StateLeader {
r.logger.Infof("%x is starting a new election at term %d", r.id, r.Term)
r.campaign()
} else {
r.logger.Debugf("%x ignoring MsgHup because already leader", r.id)
}
return nil
}
switch {
case m.Term == 0:
// local message
case m.Term > r.Term:
lead := m.From
if m.Type == pb.MsgVote {
lead = None
}
r.logger.Infof("%x [term: %d] received a %s message with higher term from %x [term: %d]",
r.id, r.Term, m.Type, m.From, m.Term)
r.becomeFollower(m.Term, lead)
case m.Term < r.Term:
// ignore
r.logger.Infof("%x [term: %d] ignored a %s message with lower term from %x [term: %d]",
r.id, r.Term, m.Type, m.From, m.Term)
return nil
}
r.step(r, m)
return nil
}
直接进入campaign的函数实现,进入campaign的第一步就是从follower切换到candidate角色,在开始竞选时循环把竞选消息发送给每一个已知的peer:
// etcd-2.3.7/raft/raft.go
func (r *raft) campaign() {
r.becomeCandidate()
// 竞选消息处理时也会进行这个函数,跟进TODO
if r.quorum() == r.poll(r.id, true) {
r.becomeLeader()
return
}
for id := range r.prs {
if id == r.id {
continue
}
r.logger.Infof("%x [logterm: %d, index: %d] sent vote request to %x at term %d",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), id, r.Term)
// peer之间消息的传递,不是本节的重点,暂时先不多做描述了
r.send(pb.Message{To: id, Type: pb.MsgVote, Index: r.raftLog.lastIndex(), LogTerm: r.raftLog.lastTerm()})
}
}
通过campaign中构造的竞选消息我们可以得知,消息中主要有三项:
- type:消息类型MsgVote,用于选举
- index:候选人的最后日志条目的索引值
- term:候选人最后日志条目的任期号
- id:发送给那个peer,
竞选者的id是在send里面赋值的,这样可以保证每次都会在发送消息的时候把id带上,如下代码:
// etcd-2.3.7/raft/raft.go
// send persists state to stable storage and then sends to its mailbox.
func (r *raft) send(m pb.Message) {
// 把竞选者id填充到消息里面
m.From = r.id
// do not attach term to MsgProp
// proposals are a way to forward to the leader and
// should be treated as local message.
if m.Type != pb.MsgProp {
m.Term = r.Term
}
// 把消息放到消息队列里面,最终会通过streamWriter发送到peer
r.msgs = append(r.msgs, m)
}
上面的内容主要是从发现主节点失效,到follower切换到candidate进行竞选的过程,接下来看下当有一个raft节点收到竞选消息后是怎样处理竞选消息的,发送竞选的消息类型为MsgVote,消息的处理主要是通过raft的Step函数:
// etcd-2.3.7/raft/raft.go
func (r *raft) Step(m pb.Message) error {
if m.Type == pb.MsgHup {
if r.state != StateLeader {
r.logger.Infof("%x is starting a new election at term %d", r.id, r.Term)
r.campaign()
} else {
r.logger.Debugf("%x ignoring MsgHup because already leader", r.id)
}
return nil
}
switch {
case m.Term == 0:
// local message
case m.Term > r.Term:
// 收到竞选消息后,切换到follower状态,然后执行响应的step
lead := m.From
if m.Type == pb.MsgVote {
lead = None
}
r.logger.Infof("%x [term: %d] received a %s message with higher term from %x [term: %d]",
r.id, r.Term, m.Type, m.From, m.Term)
r.becomeFollower(m.Term, lead)
case m.Term < r.Term:
// ignore
r.logger.Infof("%x [term: %d] ignored a %s message with lower term from %x [term: %d]",
r.id, r.Term, m.Type, m.From, m.Term)
return nil
}
// step也是类似与tick一样的函数变量,不同的角色指向不同的函数,follower时可以见becomeFollower的代码,step指向stepFollower函数
r.step(r, m)
return nil
}
stepFollower的代码如下:
// etcd-2.3.7/raft/raft.go
func stepFollower(r *raft, m pb.Message) {
switch m.Type {
//...
// 处理竞选请求
case pb.MsgVote:
if (r.Vote == None || r.Vote == m.From) && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
r.electionElapsed = 0
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] voted for %x [logterm: %d, index: %d] at term %d",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.From, m.LogTerm, m.Index, r.Term)
r.Vote = m.From
r.send(pb.Message{To: m.From, Type: pb.MsgVoteResp})
} else {
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected vote from %x [logterm: %d, index: %d] at term %d",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.From, m.LogTerm, m.Index, r.Term)
r.send(pb.Message{To: m.From, Type: pb.MsgVoteResp, Reject: true})
}
}
}
从上的代码可知,是否投票的判断逻辑主要在raftLog.isUpToDate,相关代码如下:
func (l *raftLog) isUpToDate(lasti, term uint64) bool {
return term > l.lastTerm() || (term == l.lastTerm() && lasti >= l.lastIndex())
}
如果满足投票条件,follower就会返回MsgVoteResp类型的消息,并且reject=false
r.send(pb.Message{To: m.From, Type: pb.MsgVoteResp})
竞选leader的raft节点处于candidate角色,参考follower收到消息的处理函数为stepFollower,那么大概率的可以推理出candidate处理函数应该是stepCandidate,stepCandidate的代码如下:
// etcd-2.3.7/raft/raft.go
func stepCandidate(r *raft, m pb.Message) {
switch m.Type {
// ...
// 与它竞选leader的请求直接拒绝 Reject: true
case pb.MsgVote:
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected vote from %x [logterm: %d, index: %d] at term %d",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.From, m.LogTerm, m.Index, r.Term)
r.send(pb.Message{To: m.From, Type: pb.MsgVoteResp, Reject: true})
// 处理投票请求,主要的收集统计投票信息的逻辑在poll里面
case pb.MsgVoteResp:
gr := r.poll(m.From, !m.Reject)
r.logger.Infof("%x [quorum:%d] has received %d votes and %d vote rejections", r.id, r.quorum(), gr, len(r.votes)-gr)
// 每次收到一个投票都会调用一次,如果反对票数等于quorum,变为Follower,如果支持票数等于quorum,变成leader,并广播通知自己成为leader
switch r.quorum() {
case gr:
r.becomeLeader()
r.bcastAppend()
case len(r.votes) - gr:
r.becomeFollower(r.Term, None)
}
}
}
poll函数主要是candidate收集和统计票数信息,每次收到一个投票都会调用一次:
func (r *raft) poll(id uint64, v bool) (granted int) {
// ... log相关信息,暂时忽略
if _, ok := r.votes[id]; !ok {
// id为投票的raft节点id,v(响应里面的Rejected参数)为bool,支持为false,反对为true
r.votes[id] = v
}
for _, vv := range r.votes {
if vv {
granted++
}
}
// 返回支持的票数
return granted
}