The log of etcd's raft implementation

1 Introduction

What is raft I don’t think I need to explain too much. There are too many articles on the Internet, and the implementation of raft is not a few. I only choose the implementation version of etcd for a simple reason: I dream of developing a distributed operating system, this distribution The core of the distributed operating system requires a highly reliable and high-performance object storage. etcd is a very good project, and many other good projects (such as kubernetes) are developed based on etcd, so I started to understand etcd more deeply.

Before starting the formal content of this article, the author now gives a brief introduction to several concepts (concept alignment is convenient for understanding):

  1. User: etcd encapsulates raft into an independent package, and only implements the raft algorithm itself. The author also hopes to treat etcd's raft package as an independent package analysis, so I will use the concept of user. If the reader cannot understand the user, then just treat the user as etcd;
  2. Index: It mainly refers to the log index, which is an unsigned 64 integer data. The raft log index increases in the order of the generation time of the log, and the leader of the raft keeps the index in an orderly and self-increasing mechanism. This will be in Other articles explain in detail;
  3. Submission: Readers can search for the two submission protocols to understand the concept of submission, which probably means that the leader submits the log after receiving more than half of the node reply messages. The submitted log will be considered reliable by raft because it has been received by more than half of the nodes. By the way, explain the submission index, which is the largest index in all submission logs.
  4. Application: As mentioned in Article 1, raft is an independent package without business logic. Raft itself has nothing to do with object storage. etcd encapsulates a user's PUT operation into a log and synchronizes it to all etcd nodes through raft. Only when the etcd node obtains the log and executes the PUT operation is the user's request completed. "The etcd node obtains the log and executes it" is defined as an application.
  5. Session: Since the concept of leader was mentioned in 2, the concept of term is simply put forward. Because Raft elects the leader through elections, each round of election must have a unique identification, so it is the term. It is the same thing that the news often says about the term of the leader. The session is also an incremental 64-bit unsigned value, which remains unique since the creation of the cluster.
  6. Nodes: Because raft is a distributed consensus algorithm, it is generally used in clusters with 3, 5, 7... nodes. The raft user and the raft itself are compiled and linked into a program, so the node where the raft is located is the same as the node where the user is located.
  7. Snapshot: Snapshot can be understood as the full amount of the user at a certain moment. Readers who understand the video coding and decoding technology can understand it as a key frame. The status of the user is equal to the log entries after the snapshot is applied one by one on the basis of the most recent snapshot, and the logs before the snapshot are useless, so that reliable log storage can be achieved under limited storage space, otherwise unlimited log entries will be Will consume unlimited storage space. With snapshots, system recovery becomes more efficient. Otherwise, starting from 0 and applying log by log, the efficiency will be too low. By default, etcd will take a snapshot every 10,000 logs.

2. Analysis

2.1 What is log

This log is not the other log. The log we want to focus on in this article is not the log we use for debugging, but the business log of raft. The log is not text but binary data. The concept of log can be simply represented by the following figure:

The left side of the figure is the state of the system at time A, and the right side of the figure is the system at time B by applying 1, 2, 3... The update to the system can reach the state at time A (this is why the raft article A lot of reasons for state machines are mentioned in). Here, the content of 1, 2, 3... is log, its own serial number is the index index, and the state at time B is snapshot.

It should be noted that raft does not implement specific services, it just allows multiple nodes to apply the same log in an orderly manner to achieve state consistency. As for the specific content of the log, raft does not care at all, this is what users need to care about. This can be seen through the protobuf definition of a single log:

// 源码源自go.etcd.io/etcd/raft/raftpb.raft.proto
// 在etcd的raft中,把一条日志定义为entry,而log是日志的统称。
enum EntryType {
  EntryNormal     = 0;  // 常规日志
  EntryConfChange = 1;  // 配置修改日志
}
// 一条日志的定义,包含了届、索引、类型以及日志数据本身,所以entry可以解读为第term届leader产生的
// 第index条日志,类型是type, 内容是data。
message Entry {
  optional uint64     Term  = 2 [(gogoproto.nullable) = false]; // must be 64-bit aligned for atomic operations
  optional uint64     Index = 3 [(gogoproto.nullable) = false]; // must be 64-bit aligned for atomic operations
  optional EntryType  Type  = 1 [(gogoproto.nullable) = false];
  optional bytes      Data  = 4;
}

As you can see from the above code, Entry.Data is a binary type, and the user is responsible for serializing the business into a binary number (such as adding, deleting, and modifying etcd), and then deserializing it to perform the corresponding business operation when applying the log. As for the meaning of the other fields of Entry, it will be explained in subsequent chapters and other articles.

2.2log implementation

In fact, the log module in etcd's raft package only implements the simple management of Entry. It is described in C++ language to describe various operations of std::vector<Entry>. The new log is appended to the end, and the log is popped from the head to the etcd application. The whole process seems extremely simple. This is really simple in a single-machine system (single process to be exact), but it becomes a bit more complicated for a distributed stateful system. Take a simple example: the leader sends logs to other nodes, should other nodes honestly append logs? What if the log fails? If readers do not understand why the log is invalid, you can search for the two-stage submission agreement and the third-stage submission agreement to find out. Raft uses a two-phase commit protocol, that is, the leader needs to confirm that more than half of the nodes have received the log before sending a commit command to the peer. If the leader hangs up before sending the commit command, a new round of elections will be initiated. At this time, the logs submitted by the old leader without confirmation become invalid logs.

Although log management has become more complicated, the core principle is still the various operations of std::vector<Entry>, which is nothing more than a slight change in the form. Let's start the analysis from the definition of log, as shown in the following code:

// 源码源自go.etcd.io/etcd/raft/log.go
type raftLog struct {
  // 源码注释翻译:storage存储了从最后一次snapshot到现在的所有可靠的(stable)日志(Entry),
  // 读者看到这里估计会认为这是存储日志用的,确实是用来存储日志的,但是这个存储没有持久化能力,
  // 日志的持久化相关的处理raft并不关心,这个存储可以理解为日志的缓存,raft访问最近的日志都是
  // 通过他。也就是说storage中的日志在使用者的持久化存储中也有一份,当raft需要访问这些日志的时候,
  // 无需访问持久化的存储,不仅效率高,而且与持久化存储充分解耦。那storage中缓存了多少日志呢?
  // 从当前到上一次快照之间的所有日志(entry)以及上一次快照都在storage中。这个可以从Storage的
  // MemoryStorage实现可以看出来,后面章节会详细说明。
  storage Storage

  // 源码注释翻译:与storage形成了鲜明的对比,那些还没有被使用者持久化的日志存储在unstable中,
  // 如果说storage是持久化存储日志的cache,也就是说这些日志因为持久化了所以变得可靠,但是日志
  // 持久化需要时间,并且是raft通过异步的方式(具体实现后续有文章介绍)把需要持久化的日志输出给
  // 使用者。在使用者没有通知raft日志持久化完毕前,这些日志都还不可靠,需要用unstable来管理。
  // 当使用者持久化完毕后,这些日志就会从unstable删除,并记录在storage中。
  unstable unstable

  // 源码注释翻译:被超过半数以上peer的raftLog.storage存储的日志中的最大索引值。
  // 该值记录者当前节点已知的提交索引,需要注意,提交索引是集群的一个状态,而不是某一节点的状态,
  // 但是leader在广播提交索引的时候会因为多种原因造成到达速度不一,所以每个节点知道的提交索引
  // 可能不同。
  committed uint64
  
  // 前文就提到了应用这个词,日志的有序存储并不是目的,日志的有序应用才是目的,已经提交的日志就要
  // 被使用者应用,应用就是把日志数据反序列化后在系统上执行。applied就是该节点已经被应用的最
  // 大索引。应用索引是节点状态(非集群状态),这取决于每个节点的应用速度。
  applied uint64

  // 这个不解释了,用来打印日志的
  logger Logger

  // raftLog有一个成员函数nextEnts(),用于获取(applied,committed]的所有日志,很容易看出来
  // 这个函数是应用日志时调用的,maxNextEntsSize就是用来限制获取日志大小总量的,避免一次调用
  // 产生过大粒度的应用操作。
  maxNextEntsSize uint64
}

As can be seen from the definition of raftLog, in addition to committed and applied records, the core is storage and unstable. And storage and unstable can be regarded as std::vector<Entry>, so the user manages the log management hierarchically, and each layer of management can be simply abstracted as the basic operation of the queue.

2.2.1unstable

Look at his definition before analyzing the unstable function:

// 代码源自go.etcd.io/etcd/raft/log_unstable.go
type unstable struct {
    // 看到这个变量时,应该意识到笔者前面说的日志只是Entry的集合,其实这是不全面的,应该还有
    // Snapshot。快照不是本文重点,读者可以简单理解为使用者某一时刻的全量序列化后的二进制数即可。
    // 笔者没有重点说明快照的原因是:是否理解快照本身对于理解raft影响不大,所以读者如果感兴趣可以
    // 自行了解。
    snapshot *pb.Snapshot
    
    // 日志数组,证实了笔者前面std::vector<Entry>的观点,应该不用多说了吧
    entries []pb.Entry
    // offset是第一个不可靠日志的索引,相信有读者肯定会提出疑问,直接用entries[0].Index不就
    // 可以了么?需要注意的是,在系统在很多时刻entries是空的,比如刚启动,日志持久化完成等,所以
    // 需要单独一个变量。当entries为空的时候,该值就是未来可能成为不可靠日志的第一个索引。这个变
    // 量在索引entries的时候非常有帮助,比如entries[index - offset]就获取了索引为index的Entry
    offset  uint64
    // 打印运行日志用的,不多解释
    logger Logger
}

Seeing that the definition of unstable is relatively simple, and the interface functions of unstable are basically operating log arrays, the author simply commented that it is really not difficult.

// 代码源自go.etcd.io/etcd/raft/log_unstable.go
// 需要注意,后续章节以及其他关于raft文章中会有很多次出现maybeXxx()系列函数,这些函数都是尝试性
// 的操作,也就是可能会成功,也可能会失败,在返回值中会告诉操作的成功或者失败。这个函数用来获取第一
// 个日志的索引,这里特别需要注意,因为函数有点迷惑人,这个函数需要返回的是最近快照到现在的第一个
// 日志索引。所以说就是快照的索引,切不可把它当做是第一个不可靠日志的索引。
func (u *unstable) maybeFirstIndex() (uint64, bool) {
    if u.snapshot != nil {
        return u.snapshot.Metadata.Index + 1, true
    }
    return 0, false
}

// 获取最后一条日志的索引
func (u *unstable) maybeLastIndex() (uint64, bool) {
    // 如果日志数组中有日志条目,那就返回最后一个条目的索引。
    if l := len(u.entries); l != 0 {
        return u.offset + uint64(l) - 1, true
    }
    // 没有日志条目,如果有快照那就返回快照的索引,该状态是快照还没有持久化,这个时间不会有日志,
    // 所以快照既是第一个也是最后一个索引。
    if u.snapshot != nil {
        return u.snapshot.Metadata.Index, true
    }
    return 0, false
}

// 获取日志索引所在的届,实现方法比较简单,通过索引定位日志,然后返回日志的届。
func (u *unstable) maybeTerm(i uint64) (uint64, bool) {
    // 如果索引小于offset,那就只有快照这一条路了,否则就是过于古老的日志,这里是查不到了。
    if i < u.offset {
        if u.snapshot == nil {
            return 0, false
        }
        if u.snapshot.Metadata.Index == i {
            return u.snapshot.Metadata.Term, true
        }
        return 0, false
    }
    // 如果比最大日志索引还大,超出处理范围也只能返回失败。
    last, ok := u.maybeLastIndex()
    if !ok {
        return 0, false
    }
    if i > last {
        return 0, false
    }
    
    return u.entries[i-u.offset].Term, true
}
// 这个函数是在使用者持久化不可靠日志后触发的调用,可靠的日志索引已经到达了i.
func (u *unstable) stableTo(i, t uint64) {
    // 获得日志的届
    gt, ok := u.maybeTerm(i)
    if !ok {
        return
    }
    // 届值匹配的情况下把i以前的不可靠日志从数组中删除。
    if gt == t && i >= u.offset {
        u.entries = u.entries[i+1-u.offset:]
        u.offset = i + 1
        u.shrinkEntriesArray()
    }
}
// 这个函数是快照持久完成后触发的调用
func (u *unstable) stableSnapTo(i uint64) {
    if u.snapshot != nil && u.snapshot.Metadata.Index == i {
        u.snapshot = nil
    }
}
// 这个函数是接收到leader发来的快照后调用的,暂时存入unstable等待使用者持久化。
func (u *unstable) restore(s pb.Snapshot) {
    u.offset = s.Metadata.Index + 1
    u.entries = nil
    u.snapshot = &s
}

// 存储不可靠日志,这个函数是leader发来追加日志消息的时候触发调用的,raft先把这些日志存储在
// unstable中等待使用者持久化。为什么是追加?因为日志是有序的,leader发来的日志一般是该节点
// 紧随其后的日志亦或是有些重叠的日志,看似像是一直追加一样。
func (u *unstable) truncateAndAppend(ents []pb.Entry) {
    after := ents[0].Index
    switch {
    // 刚好接在当前日志的后面,理想的追加
    case after == u.offset+uint64(len(u.entries)):
        u.entries = append(u.entries, ents...)
    // 这种情况存储可靠存储的日志还没有被提交,此时新的leader不在认可这些日志,所以替换追加
    case after <= u.offset:
        u.logger.Infof("replace the unstable entries from index %d", after)
        u.offset = after
        u.entries = ents
    // 有重叠的日志,那就用最新的日志覆盖老日志,覆盖追加
    default:
        u.logger.Infof("truncate the unstable entries before index %d", after)
        u.entries = append([]pb.Entry{}, u.slice(u.offset, after)...)
        u.entries = append(u.entries, ents...)
    }
}
// 截取(lo,hi]的日志
func (u *unstable) slice(lo uint64, hi uint64) []pb.Entry {
    u.mustCheckOutOfBounds(lo, hi)
    return u.entries[lo-u.offset : hi-u.offset]
}

2.2.2Storage

Storage is an abstract class (interface), defined as follows:

// 源码源自go.etcd.io/etcd/raft/storage.go
type Storage interface {
    // 使用者在构造raft时,需要传入初始状态,这些状态存储在可靠存储中,使用者需要通过Storage
    // 告知raft。关于状态的定义不在本文导论范围,笔者会在其他文章中详细说明。
    InitialState() (pb.HardState, pb.ConfState, error)
    // 获取索引在[lo,hi)之间的日志,日志总量限制在maxSize
    Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)
    // 获取日志索引为i的届
    Term(i uint64) (uint64, error)
    // 获取最后一条日志的索引
    LastIndex() (uint64, error)
    // 获取第一条日志的索引
    FirstIndex() (uint64, error)
    // 获取快照
    Snapshot() (pb.Snapshot, error)
}

There is only one storage implementation of raft, and that is MemoryStorage. It can be seen from the name that it is implemented in memory. Isn't it a variant of std::vector<Entry>?

// 代码源自go.etcd.io/etcd/raft/storage.go
// 简单到不想在做多余解释了....
type MemoryStorage struct {
    sync.Mutex
    hardState pb.HardState
    snapshot  pb.Snapshot
    ents []pb.Entry
}

As for how MemoryStorage implements the Storage interface, I think readers can imagine it, and no longer waste pen and ink.

2.2.3raftLog

Let's first come to a large source code comment to see which functions are implemented by raftLog. From the previous definition of raftLog, the functions are not too many and not complicated. The author needs to go to the end, no need for a long explanation.

// 代码源自go.etcd.io/etcd/raft/log.go
// raftLog的构造函数
func newLogWithSize(storage Storage, logger Logger, maxNextEntsSize uint64) *raftLog {
    if storage == nil {
        log.Panic("storage must not be nil")
    }
    // 创建raftLog对象
    log := &raftLog{
        storage:         storage,
        logger:          logger,
        maxNextEntsSize: maxNextEntsSize,
    }
    // 使用者启动需要把持久化的快照以及日志存储在storage中,前面已经提到了,这个
    // storage类似于使用者持久化存储的cache。
    firstIndex, err := storage.FirstIndex()
    if err != nil {
        panic(err) // TODO(bdarnell)
    }
    lastIndex, err := storage.LastIndex()
    if err != nil {
        panic(err) // TODO(bdarnell)
    }
    // 这个代码印证了前面提到了,当unstable没有不可靠日志的时候,unstable.offset的值就是
    // 未来的第一个不可靠日志的索引。
    log.unstable.offset = lastIndex + 1
    log.unstable.logger = logger
    // 初始化提交索引和应用索引,切记只是初始化,raft在构造完raftLog后还会设置这两个值,所以下面
    // 赋值感觉奇怪的可以忽略它。
    log.committed = firstIndex - 1
    log.applied = firstIndex - 1

    return log
}

// 追加日志,在收到leader追加日志的消息后被调用。为什么是maybe?更确切的说什么原因会失败?这就要
// 从index,logTerm这两个参数说起了。raft会把若干个日志条目(Entry)封装在一个消息(Message)中,
// 同时在消息中还有index和logTerm两个参数,就是下面函数传入的同名参数。这两个参数是ents前一条日
// 志的索引和届,笔者会在其他文章介绍leader向其他节点发送日志的方法,此处只需要知道一点,leader有
// 一个参数记录下一次将要发送给某个节点的索引起始值,也就是ents[0].Index,而index和logTerm值就是
// ents[-1].Index和ents[-1].Term。知道这两个参数再来看源码注释。
func (l *raftLog) maybeAppend(index, logTerm, committed uint64, ents ...pb.Entry) (lastnewi uint64, ok bool) {
    // 这一批日志的前一条日志届值都不匹配,那么这些日志条目都会被拒收追加,那么就会追加失败。不匹配
    // 有两种情况:没有这个索引或者届值不相等,这可能是前一条日志还没到或者前一条日志已经是新leader
    // 发来的了。第一种情况好理解,丢包了,造成新的日志包无法追加成功;第二种情况笔者只能理解为网络
    // 分区重新选举后老的leader又回到了集群中向节点发送了已经过时的日志。
    if l.matchTerm(index, logTerm) {
        // 计算最新的日志索引
        lastnewi = index + uint64(len(ents))
        // 找到冲突的日志,所谓冲突还是届值不匹配,findConflict()后面有注释。
        ci := l.findConflict(ents)
        // 对于冲突有三种可能:
        switch {
        // 没有任何冲突,也就是说所有日志节点已经有了,那就是重发的消息了。
        case ci == 0:
        // 冲突的起始索引不大于提交索引,这个不应该发生,只能选择崩溃了~
        case ci <= l.committed:
            l.logger.Panicf("entry %d conflict with committed entry [committed(%d)]", ci, l.committed)
        // 有一部分冲突,新的日志重新追加。
        default:
            offset := index + 1
            l.append(ents[ci-offset:]...)
        }
        // 更新提交索引,为什么取了个最小?committed是leader发来的,是全局状态,但是当前节点
        // 可能落后于全局状态,所以取了最小值。这里读者可能有疑问,lastnewi是这个节点最新的索引,
        // 不是大的可靠索引,如果此时节点异常了,会不会出现提交索引以前的日志已经被应用,但是有些
        // 日志还没有被持久化?这里笔者需要解释一下,raft更新了提交索引,raft会把提交索引以前的
        // 日志交给使用者应用同时会把不可靠日志也交给使用者持久化,所以这要求使用者必须先持久化日志
        // 再应用日志,否则就会出现刚刚提到的问题。
        l.commitTo(min(committed, lastnewi))
        return lastnewi, true
    }
    return 0, false
}
// 追加日志。
func (l *raftLog) append(ents ...pb.Entry) uint64 {
    // 没有日志。
    if len(ents) == 0 {
        return l.lastIndex()
    }
    // 日志与部分提交日志重叠,这种属于不能接受的情况
    if after := ents[0].Index - 1; after < l.committed {
        l.logger.Panicf("after(%d) is out of range [committed(%d)]", after, l.committed)
    }
    // 追加到unstable中。
    l.unstable.truncateAndAppend(ents)
    return l.lastIndex()
}
// 找到冲突日志的起始索引
func (l *raftLog) findConflict(ents []pb.Entry) uint64 {
    // 便利日志
    for _, ne := range ents {
        // 匹配届值
        if !l.matchTerm(ne.Index, ne.Term) {
            if ne.Index <= l.lastIndex() {
                l.logger.Infof("found conflict at index %d [existing term: %d, conflicting term: %d]",
          ne.Index, l.zeroTermOnErrCompacted(l.term(ne.Index)), ne.Term)
            }
            // 返回第一个届值不匹配的日志索引。
            return ne.Index
        }
    }
    return 0
}

// 获取不可靠日志,就是把unstable的所有日志输出,这个函数用于输出给使用者持久化
func (l *raftLog) unstableEntries() []pb.Entry {
    if len(l.unstable.entries) == 0 {
        return nil
    }
    return l.unstable.entries
}

// 获取应用索引到提交索引间的所有日志,这个函数用于输出给使用者应用日志
func (l *raftLog) nextEnts() (ents []pb.Entry) {
    off := max(l.applied+1, l.firstIndex())
    if l.committed+1 > off {
        ents, err := l.slice(off, l.committed+1, l.maxNextEntsSize)
        if err != nil {
            l.logger.Panicf("unexpected error when getting unapplied entries (%v)", err)
        }
        return ents
    }
    return nil
}

// 判断是否有可应用的日志
func (l *raftLog) hasNextEnts() bool {
    off := max(l.applied+1, l.firstIndex())
    return l.committed+1 > off
}

// 获取快照
func (l *raftLog) snapshot() (pb.Snapshot, error) {
    if l.unstable.snapshot != nil {
        return *l.unstable.snapshot, nil
    }
    return l.storage.Snapshot()
}

// 获取第一个索引,读者可能会问第一个日志索引不应该是0或者是1么?(取决于索引初始值),但是raft会
// 周期的做快照,快照之前的日志就没用了,所以第一个日志索引不一定是0.
func (l *raftLog) firstIndex() uint64 {
    if i, ok := l.unstable.maybeFirstIndex(); ok {
        return i
    }
    index, err := l.storage.FirstIndex()
    if err != nil {
        panic(err) // TODO(bdarnell)
    }
    return index
}
// 获取最后一条日志的索引。
func (l *raftLog) lastIndex() uint64 {
    if i, ok := l.unstable.maybeLastIndex(); ok {
        return i
    }
    i, err := l.storage.LastIndex()
    if err != nil {
        panic(err) // TODO(bdarnell)
    }    
    return i
}
// 更新提交索引,
func (l *raftLog) commitTo(tocommit uint64) {
    // never decrease commit
    if l.committed < tocommit {
        if l.lastIndex() < tocommit {
            l.logger.Panicf("tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost?", tocommit, l.lastIndex())
        }
        l.committed = tocommit
    }
}
// 更新应用索引
func (l *raftLog) appliedTo(i uint64) {
    if i == 0 {
        return
    }
    if l.committed < i || i < l.applied {
        l.logger.Panicf("applied(%d) is out of range [prevApplied(%d), committed(%d)]", i, l.applied, l.committed)
    }
    l.applied = i
}
// 使用者告知raftLog日志已经持久化到哪个索引了
func (l *raftLog) stableTo(i, t uint64) { l.unstable.stableTo(i, t) }
// 使用者告知raftLog索引值为i的快照已经持久化了
func (l *raftLog) stableSnapTo(i uint64) { l.unstable.stableSnapTo(i) }
// 获取最后一条日志的届值
func (l *raftLog) lastTerm() uint64 {
    t, err := l.term(l.lastIndex())
    if err != nil {
        l.logger.Panicf("unexpected error when getting the last term (%v)", err)
    }
    return t
}
// 获取日志届值
func (l *raftLog) term(i uint64) (uint64, error) {
    // 如果索引在raftLog记录的所有日志之外,那么只能返回0代表没找到
    dummyIndex := l.firstIndex() - 1
    if i < dummyIndex || i > l.lastIndex() {
        // TODO: return an error instead?
        return 0, nil
    }
    // 在unstable中找一下
    if t, ok := l.unstable.maybeTerm(i); ok {
        return t, nil
    }
    // unstable中没有那就在storage找
    t, err := l.storage.Term(i)
    if err == nil {
        return t, nil
    }
    // 如果storage和unstable都没有,那也算是没找到。这是因为storage可能会压缩,比如把应用
    // 索引以前的日志删除,因为他们已经没用了,这样可以节省内存空间。
    if err == ErrCompacted || err == ErrUnavailable {
        return 0, err
    }
    panic(err) // TODO(bdarnell)
}
// 获取从索引值为i之后的所有日志,但是日志总量限制在maxsize
func (l *raftLog) entries(i, maxsize uint64) ([]pb.Entry, error) {
    if i > l.lastIndex() {
        return nil, nil
    }
    return l.slice(i, l.lastIndex()+1, maxsize)
}

// 获取所有日志
func (l *raftLog) allEntries() []pb.Entry {
    ents, err := l.entries(l.firstIndex(), noLimit)
    if err == nil {
        return ents
    }
    if err == ErrCompacted { // try again if there was a racing compaction
        return l.allEntries()
    }
    // TODO (xiangli): handle error?
    panic(err)
}

// 判断给定日志的索引和届值是不是比raftLog中的新
func (l *raftLog) isUpToDate(lasti, term uint64) bool {
    return term > l.lastTerm() || (term == l.lastTerm() && lasti >= l.lastIndex())
}
// 匹配日志届值
func (l *raftLog) matchTerm(i, term uint64) bool {
    // 获取日志的届值,如果这个日志不存在匹配失败
    t, err := l.term(i)
    if err != nil {
        return false
    }
    // 如果日志存在,那么比较届值是否相等
    return t == term
}
// 更新提交索引
func (l *raftLog) maybeCommit(maxIndex, term uint64) bool {
    if maxIndex > l.committed && l.zeroTermOnErrCompacted(l.term(maxIndex)) == term {
        l.commitTo(maxIndex)
        return true
    }
    return false
}

func (l *raftLog) restore(s pb.Snapshot) {
    l.logger.Infof("log [%s] starts to restore snapshot [index: %d, term: %d]", l, s.Metadata.Index, s.Metadata.Term)
    l.committed = s.Metadata.Index
    l.unstable.restore(s)
}

// 获取(lo,hi]的所有日志,但是总量限制在maxSize
func (l *raftLog) slice(lo, hi, maxSize uint64) ([]pb.Entry, error) {
    // 判断lo和hi的合法性
    err := l.mustCheckOutOfBounds(lo, hi)
    if err != nil {
        return nil, err
    }
    if lo == hi {
        return nil, nil
    }
    // 日志有一部分落在storage中
    var ents []pb.Entry
    if lo < l.unstable.offset {
        storedEnts, err := l.storage.Entries(lo, min(hi, l.unstable.offset), maxSize)
        if err == ErrCompacted {
            return nil, err
        } else if err == ErrUnavailable {
            l.logger.Panicf("entries[%d:%d) is unavailable from storage", lo, min(hi, l.unstable.offset))
        } else if err != nil {
            panic(err) // TODO(bdarnell)
        }

        // 这个判断有意思,如果从storage获取的日志数量比预期少,说明没那么多日志存在storage中,
        // 那也就没必要再找unstable了。
        if uint64(len(storedEnts)) < min(hi, l.unstable.offset)-lo {
            return storedEnts, nil
        }

        ents = storedEnts
    }
    // 日志有一部分在unstable中。
    if hi > l.unstable.offset {
        unstable := l.unstable.slice(max(lo, l.unstable.offset), hi)
        if len(ents) > 0 {
            combined := make([]pb.Entry, len(ents)+len(unstable))
            n := copy(combined, ents)
            copy(combined[n:], unstable)
            ents = combined
        } else {
            ents = unstable
        }
    }
    return limitSize(ents, maxSize), nil
}

Finally, the main functions of raftLog can be represented by the following figure:

The module will be explained in detail in other articles. This article only explains the working principle of the log module. Here, it can be assumed that x and y respectively represent a routine. The explanation of the above figure is as follows:

  1. x Call the raftLog.maybeAppend() interface after receiving the leader's request for additional logs, if it is a snapshot request, call restore;
  2. x will also receive the submission request sent by the leader (for example, the submitted index can be carried through the heartbeat packet), and call the raftLog.commitTo() interface to update the submission index value;
  3. y Obtain the (applied, committed) log through nextEnts() to the user application, obtain the (committed, last) log through unstableEntries() for the user to persist, and obtain the unstable.snapshot snapshot for the user to persist. Then y Then call appliedTo(), stableTo(), stableSnapTo() to update the status of raftLog;

3. Summary

Each log entry needs to go through five stages: unstable, stable, committed, applied, and compacted. Next, summarize the log state transition process:

  1. A log that has just been received will be stored in unstable. If the log encounters a general election before it is persisted, the log may be overwritten by a new log with the same index value. This can be found in raftLog.maybeAppend() and unstable. truncateAndAppend() finds the relevant processing logic.
  2. Logs stored in unstable will be written to persistent storage (file) by users, and these persistent logs will be transferred from unstable to MemoryStorage. Readers may ask that MemoryStorage is not a persistent storage. In fact, the log is double-written, and the file and MemoryStorage each store a copy, and the raft package can only access the content in MemoryStorage. The purpose of this design is to use the memory to buffer the log in the file, and the performance will be higher when the log is frequently operated. It should be noted here that the log in MemoryStorage only represents that the log is reliable, and has nothing to do with submission and application.
  3. The leader will collect the receiving log status of all peers. As long as the log is received by more than half of the peers, the log will be submitted. The peer receives the leader's data packet and updates its submitted maximum index value, which is less than or equal to the index value. The log is the log that can be submitted.
  4. The logs that have been submitted will be obtained by the user and applied one by one, thereby affecting the user's data status.
  5. The log that has been applied means that the user has persisted the state in their own storage, and this log can be deleted, avoiding the problem of increasing the storage indefinitely due to the continuous addition of the log. Don't forget that all logs are stored in MemoryStorage. Not deleting the applied logs is a waste of memory, which is the compacted log.

Guess you like

Origin blog.csdn.net/weixin_42663840/article/details/100005978