etcd-raft leader选举 2.3.7

raft的leader的选举,依赖于心跳包的超时,etcd-raft的周期性心跳信号由定时器产生,该定时器在扩展的raftNode(raft.node实现raft.Node接口)类中启动,由于raftNode节点一般为扩展etcd-raft算法的结合需求的可定制类,之所以把tick的产生信号放在raftNode主要是方便修改心跳超时时间时不用修改raft协议核心代码,可以见EtcdServer的raftNode:

// etcd-2.3.7/etcdserver/raft.go
// start prepares and starts raftNode in a new goroutine. It is no longer safe
// to modify the fields after it has been started.
// TODO: Ideally raftNode should get rid of the passed in server structure.
func (r *raftNode) start(s *EtcdServer) {
    // ...

    heartbeat := 200 * time.Millisecond
    if s.cfg != nil {
        heartbeat = time.Duration(s.cfg.TickMs) * time.Millisecond
    }   
    // set up contention detectors for raft heartbeat message.
    // expect to send a heartbeat within 2 heartbeat intervals.
    r.td = contention.NewTimeoutDetector(2 * heartbeat)

    go func() {
        var syncC <-chan time.Time

        defer r.onStop()
        islead := false

        for {
            select {
            case <-r.ticker:
                r.Tick()
            //....
            } // select
        } // for
    }()
}

etcd-raft算法模块接收这个信号是在,raft.node.run方法中,通过node.ntick channel接收然后调用raft.raft的tick:

// etcd-2.3.7/raft/node.go
// raftNode就是调用node的这个方法往里面写入周期的心跳信号
func (n *node) Tick() {
    select {
    case n.tickc <- struct{}{}:
    case <-n.done:
    }
}

func (n *node) run(r *raft) {
    // ...

    for {
    // ...
       select {
       // ...
       case <-n.tickc:
            r.tick()
        // ...
        } // select
    } // for
}

raft.node在收到tickc的信号之后最终会调用raft.tick,由于raft.tick是一个函数变量,在该raft处于不同的角色时tick指向的函数不一样,由于本节记录的选举,所以当前节点应该处于candidate阶段,这个阶段应该调用的是tickElection,从下面几个函数里面 可得到相关tick的变化信息:

// etcd-2.3.7/raft/raft.go

func (r *raft) becomeFollower(term uint64, lead uint64) {
    r.step = stepFollower
    r.reset(term)
    r.tick = r.tickElection
    r.lead = lead
    r.state = StateFollower
    r.logger.Infof("%x became follower at term %d", r.id, r.Term)
}

func (r *raft) becomeCandidate() {
    // TODO(xiangli) remove the panic when the raft implementation is stable
    if r.state == StateLeader {
        panic("invalid transition [leader -> candidate]")
    }   
    r.step = stepCandidate
    r.reset(r.Term + 1)
    r.tick = r.tickElection
    r.Vote = r.id
    r.state = StateCandidate
    r.logger.Infof("%x became candidate at term %d", r.id, r.Term)
}
func (r *raft) becomeLeader() {
    // TODO(xiangli) remove the panic when the raft implementation is stable
    if r.state == StateFollower {
        panic("invalid transition [follower -> leader]")
    }
    r.step = stepLeader
    r.reset(r.Term)
    r.tick = r.tickHeartbeat
    r.lead = r.id
    r.state = StateLeader
    ents, err := r.raftLog.entries(r.raftLog.committed+1, noLimit)
    if err != nil {
        r.logger.Panicf("unexpected error getting uncommitted entries (%v)", err)
    }

    for _, e := range ents {
        if e.Type != pb.EntryConfChange {
            continue
        }
        if r.pendingConf {
            panic("unexpected double uncommitted config entry")
        }
        r.pendingConf = true
    }
    r.appendEntry(pb.Entry{Data: nil})
    r.logger.Infof("%x became leader at term %d", r.id, r.Term)
}

接下来看下tickElection,主要是构造了一个MsgHup类型的消息,然后把消息传给raft.Step:

// etcd-2.3.7/raft/raft.go

// tickElection is run by followers and candidates after r.electionTimeout.
func (r *raft) tickElection() {
    if !r.promotable() {
        r.electionElapsed = 0
        return
    }
    r.electionElapsed++
    if r.isElectionTimeout() {
        r.electionElapsed = 0
        r.Step(pb.Message{From: r.id, Type: pb.MsgHup})
    }
}

接下来看下raft.Step的实现,当消息类型是MsgHup的时候会根据当前节点的state来判断走什么逻辑,由于当前节点处于还没有开始竞选,所以状态state==StateFollower(函数:becomeCandidate,becomeFollower),然后进入campaign函数开始为了竞选leader竞争:

// etcd-2.3.7/raft/raft.go

func (r *raft) Step(m pb.Message) error {
    if m.Type == pb.MsgHup {
        if r.state != StateLeader {
            r.logger.Infof("%x is starting a new election at term %d", r.id, r.Term)
            r.campaign()
        } else {
            r.logger.Debugf("%x ignoring MsgHup because already leader", r.id)
        }
        return nil 
    }   

    switch {
    case m.Term == 0:
        // local message
    case m.Term > r.Term:
        lead := m.From
        if m.Type == pb.MsgVote {
            lead = None
        }
        r.logger.Infof("%x [term: %d] received a %s message with higher term from %x [term: %d]",
            r.id, r.Term, m.Type, m.From, m.Term)
        r.becomeFollower(m.Term, lead)
    case m.Term < r.Term:
        // ignore
        r.logger.Infof("%x [term: %d] ignored a %s message with lower term from %x [term: %d]",
            r.id, r.Term, m.Type, m.From, m.Term)
        return nil 
    }   
    r.step(r, m)
    return nil 
}

直接进入campaign的函数实现,进入campaign的第一步就是从follower切换到candidate角色,在开始竞选时循环把竞选消息发送给每一个已知的peer:

// etcd-2.3.7/raft/raft.go

func (r *raft) campaign() {
    r.becomeCandidate()

    // 竞选消息处理时也会进行这个函数,跟进TODO
    if r.quorum() == r.poll(r.id, true) {
        r.becomeLeader()
        return
    }
    for id := range r.prs {
        if id == r.id {
            continue
        }
        r.logger.Infof("%x [logterm: %d, index: %d] sent vote request to %x at term %d",
            r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), id, r.Term)
        // peer之间消息的传递,不是本节的重点,暂时先不多做描述了
        r.send(pb.Message{To: id, Type: pb.MsgVote, Index: r.raftLog.lastIndex(), LogTerm: r.raftLog.lastTerm()})
    }
}

通过campaign中构造的竞选消息我们可以得知,消息中主要有三项:

  • type:消息类型MsgVote,用于选举
  • index:候选人的最后日志条目的索引值
  • term:候选人最后日志条目的任期号
  • id:发送给那个peer,
    竞选者的id是在send里面赋值的,这样可以保证每次都会在发送消息的时候把id带上,如下代码:
// etcd-2.3.7/raft/raft.go

// send persists state to stable storage and then sends to its mailbox.
func (r *raft) send(m pb.Message) {
    // 把竞选者id填充到消息里面
    m.From = r.id
    // do not attach term to MsgProp
    // proposals are a way to forward to the leader and
    // should be treated as local message.
    if m.Type != pb.MsgProp {
        m.Term = r.Term
    }   
    // 把消息放到消息队列里面,最终会通过streamWriter发送到peer
    r.msgs = append(r.msgs, m)
}

上面的内容主要是从发现主节点失效,到follower切换到candidate进行竞选的过程,接下来看下当有一个raft节点收到竞选消息后是怎样处理竞选消息的,发送竞选的消息类型为MsgVote,消息的处理主要是通过raft的Step函数:

// etcd-2.3.7/raft/raft.go

func (r *raft) Step(m pb.Message) error {
    if m.Type == pb.MsgHup {
        if r.state != StateLeader {
            r.logger.Infof("%x is starting a new election at term %d", r.id, r.Term)
            r.campaign()
        } else {
            r.logger.Debugf("%x ignoring MsgHup because already leader", r.id)
        }
        return nil
    }

    switch {
    case m.Term == 0:
        // local message
    case m.Term > r.Term:
        // 收到竞选消息后,切换到follower状态,然后执行响应的step
        lead := m.From
        if m.Type == pb.MsgVote {
            lead = None
        }
        r.logger.Infof("%x [term: %d] received a %s message with higher term from %x [term: %d]",
            r.id, r.Term, m.Type, m.From, m.Term)
        r.becomeFollower(m.Term, lead)
    case m.Term < r.Term:
        // ignore
        r.logger.Infof("%x [term: %d] ignored a %s message with lower term from %x [term: %d]",
            r.id, r.Term, m.Type, m.From, m.Term)
        return nil
    }
    // step也是类似与tick一样的函数变量,不同的角色指向不同的函数,follower时可以见becomeFollower的代码,step指向stepFollower函数
    r.step(r, m)
    return nil
}

stepFollower的代码如下:

// etcd-2.3.7/raft/raft.go

func stepFollower(r *raft, m pb.Message) {
    switch m.Type { 
    //...
    // 处理竞选请求
    case pb.MsgVote:
        if (r.Vote == None || r.Vote == m.From) && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
            r.electionElapsed = 0
            r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] voted for %x [logterm: %d, index: %d] at term %d",
                r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.From, m.LogTerm, m.Index, r.Term)
            r.Vote = m.From
            r.send(pb.Message{To: m.From, Type: pb.MsgVoteResp})
        } else {
            r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected vote from %x [logterm: %d, index: %d] at term %d",
                r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.From, m.LogTerm, m.Index, r.Term)
            r.send(pb.Message{To: m.From, Type: pb.MsgVoteResp, Reject: true})
        }
    }
}   

从上的代码可知,是否投票的判断逻辑主要在raftLog.isUpToDate,相关代码如下:

func (l *raftLog) isUpToDate(lasti, term uint64) bool {
    return term > l.lastTerm() || (term == l.lastTerm() && lasti >= l.lastIndex())
}

如果满足投票条件,follower就会返回MsgVoteResp类型的消息,并且reject=false

r.send(pb.Message{To: m.From, Type: pb.MsgVoteResp})

竞选leader的raft节点处于candidate角色,参考follower收到消息的处理函数为stepFollower,那么大概率的可以推理出candidate处理函数应该是stepCandidate,stepCandidate的代码如下:

// etcd-2.3.7/raft/raft.go

func stepCandidate(r *raft, m pb.Message) {
    switch m.Type {
    // ...
    // 与它竞选leader的请求直接拒绝 Reject: true
    case pb.MsgVote:
        r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected vote from %x [logterm: %d, index: %d] at term %d",
            r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.From, m.LogTerm, m.Index, r.Term)
        r.send(pb.Message{To: m.From, Type: pb.MsgVoteResp, Reject: true})
    // 处理投票请求,主要的收集统计投票信息的逻辑在poll里面
    case pb.MsgVoteResp:
        gr := r.poll(m.From, !m.Reject)
        r.logger.Infof("%x [quorum:%d] has received %d votes and %d vote rejections", r.id, r.quorum(), gr, len(r.votes)-gr)
        // 每次收到一个投票都会调用一次,如果反对票数等于quorum,变为Follower,如果支持票数等于quorum,变成leader,并广播通知自己成为leader
        switch r.quorum() {
        case gr:
            r.becomeLeader()
            r.bcastAppend()
        case len(r.votes) - gr:
            r.becomeFollower(r.Term, None)
        }
    }
}

poll函数主要是candidate收集和统计票数信息,每次收到一个投票都会调用一次:

func (r *raft) poll(id uint64, v bool) (granted int) {
    // ... log相关信息,暂时忽略
    if _, ok := r.votes[id]; !ok {
        // id为投票的raft节点id,v(响应里面的Rejected参数)为bool,支持为false,反对为true
        r.votes[id] = v
    }
    for _, vv := range r.votes {
        if vv {
            granted++
        }
    }
    // 返回支持的票数
    return granted
}

猜你喜欢

转载自blog.csdn.net/u010154685/article/details/80853530