etcd 分布式数据库概念初探

Lease（租约）：

其实就是一个定时器。首先申请一个TTL=N的lease（定时器），然后创建key的时候传入该lease，那么就实现了一个定时的key。

在程序中可以定时为该lease续约，也就是不断重复的重置TTL=N。当lease过期的时候，其所关联的所有key都会自动删除。

Raft协议：

etcd基于Raft协议实现数据同步（K-V数据），集群由多个节点组成。
Raft协议理解起来相比Paxos并没有简单到哪里，因为都很难理解，所以我简单描述一下：

每次写入都是在一个事务（tx）中完成的。
一个事务（tx）可以包含若干put（写入K-V键值对）操作。
etcd集群有一个leader，写入请求都会提交给它。
leader先将数据保存成日志形式，并定时的将日志发往其他节点保存。
当超过1/2节点成功保存了日志，则leader会将tx最终提交（也是一条日志）。
一旦leader提交tx，则会在下一次心跳时将提交记录发送给其他节点，其他节点也会提交。
leader宕机后，剩余节点协商找到拥有最大已提交tx ID（必须是被超过半数的节点已提交的）的节点作为新leader。

这里最重要的是知道：

Raft中，后提交的事务ID>先提交的事务ID，每个事务ID都是唯一的。

无论客户端是在哪个etcd节点提交，整个集群对外表现出数据视图最终都是一样的。

K-V存储

etcd根本上来说是一个K-V存储，它在内存中维护了一个btree（B树），就和MySQL的索引一样，它是有序的。
在这个btree中，key就是用户传入的原始key，而value并不是用户传入的value，具体是什么后面再说，整个k-v存储大概就是这样：

type treeIndex struct {
sync.RWMutex
tree *btree.BTree
}

当存储大量的K-V时，因为用户的value一般比较大，全部放在内存btree里内存耗费过大，所以etcd将用户value保存在磁盘中。
简单的说，etcd是纯内存索引，数据在磁盘持久化，这个模型整体来说并不复杂。在磁盘上，etcd使用了一个叫做bbolt的纯K-V存储引擎（可以理解为leveldb），那么bbolt的key和value分别是什么呢？

MVCC多版本

如果仅仅维护一个K-V模型，那么连续的更新只能保存最后一个value，历史版本无从追溯，而多版本可以解决这个问题，怎么维护多个版本呢？下面是几条预备知识：

每个tx事务有唯一事务ID，在etcd中叫做main ID，全局递增不重复。
一个tx可以包含多个修改操作（put和delete），每一个操作叫做一个revision（修订），共享同一个main ID。
一个tx内连续的多个修改操作会被从0递增编号，这个编号叫做sub ID。
每个revision由（main ID，sub ID）唯一标识。

下面是revision的定义：

// A revision indicates modification of the key-value space.
// The set of changes that share same main revision changes the key-value space atomically.
type revision struct {
// main is the main revision of a set of changes that happen atomically.
main int64

// sub is the the sub revision of a change in a set of changes that happen
// atomically. Each change has different increasing sub revision in that
// set.
sub int64
}

　　在内存索引中，每个用户原始key会关联一个key_index结构，里面维护了多版本信息：

type keyIndex struct {
key         []byte
modified    revision // the main rev of the last modification
generations []generation
}

key字段就是用户的原始key，modified字段记录这个key的最后一次修改对应的revision信息。
多版本（历史修改）保存在Generations数组中，它的定义：

// generation contains multiple revisions of a key.
type generation struct {
ver     int64
created revision // when the generation is created (put in first revision).
revs    []revision
}

我称generations[i]为第i代，当一个key从无到有的时候，generations[0]会被创建，其created字段记录了引起本次key创建的revision信息。
当用户继续更新这个key的时候，generations[0].revs数组会不断追加记录本次的revision信息（main，sub）。
在多版本中，每一次操作行为都被单独记录下来，那么用户value是怎么存储的呢？就是保存到bbolt中。
在bbolt中，每个revision将作为key，即序列化（revision.main+revision.sub）作为key。因此，我们先通过内存btree在keyIndex.generations[0].revs中找到最后一条revision，即可去bbolt中读取对应的数据。
相应的，etcd支持按key前缀查询，其实也就是遍历btree的同时根据revision去bbolt中获取用户的value。
如果我们持续更新同一个key，那么generations[0].revs就会一直变大，这怎么办呢？在多版本中的，一般采用compact来压缩历史版本，即当历史版本到达一定数量时，会删除一些历史版本，只保存最近的一些版本。
下面的是一个keyIndex在compact时，Generations数组的变化：

// For example: put(1.0);put(2.0);tombstone(3.0);put(4.0);tombstone(5.0) on key "foo"
// generate a keyIndex:
// key:     "foo"
// rev: 5
// generations:
//    {empty}
//    {4.0, 5.0(t)}
//    {1.0, 2.0, 3.0(t)}
//
// Compact a keyIndex removes the versions with smaller or equal to
// rev except the largest one. If the generation becomes empty
// during compaction, it will be removed. if all the generations get
// removed, the keyIndex should be removed.

// For example:
// compact(2) on the previous example
// generations:
//    {empty}
//    {4.0, 5.0(t)}
//    {2.0, 3.0(t)}
//
// compact(4)
// generations:
//    {empty}
//    {4.0, 5.0(t)}
//
// compact(5):
// generations:
//    {empty} -> key SHOULD be removed.
//
// compact(6):
// generations:
//    {empty} -> key SHOULD be removed.

Tombstone就是指delete删除key，一旦发生删除就会结束当前的Generation，生成新的Generation，小括号里的(t)标识Tombstone。
compact(n)表示压缩掉revision.main <= n的所有历史版本，会发生一系列的删减操作，可以仔细观察上述流程。
多版本总结来说：内存btree维护的是用户key => keyIndex的映射，keyIndex内维护多版本的revision信息，而revision可以映射到磁盘bbolt中的用户value。
最后，在bbolt中存储的value是这样一个json序列化后的结构，包括key创建时的revision（对应某一代generation的created），本次更新版本，sub ID（Version ver），Lease ID（租约ID）：

kv := mvccpb.KeyValue{
    Key:            key,
    Value:          value,
    CreateRevision: c,
    ModRevision:    rev,
    Version:        ver,
    Lease:          int64(leaseID),
}

watch机制

etcd的事件通知机制是基于MVCC多版本实现的。
客户端可以提供一个要监听的revision.main作为watch的起始ID，只要etcd当前的全局自增事务ID > watch起始ID，etcd就会将MVCC在bbolt中存储的所有历史revision数据，逐一顺序的推送给客户端。
这显然和ZooKeeper是不同的，ZooKeeper总是获取最新数据并建立一个一次性的监听后续变化。而etcd支持客户端从任意历史版本开始订阅事件，并且会推送当时的数据快照给客户端。
那么，etcd大概是如何实现基于MVCC的watch机制的呢？
etcd会保存每个客户端发来的watch请求，watch请求可以关注一个key（单key），或者一个key前缀（区间）。
etcd会有一个协程持续不断的遍历所有的watch请求，每个watch对象都维护了其watch的key事件推送到了哪个revision。
etcd会拿着这个revision.main ID去bbolt中继续向后遍历，实际上bbolt类似于leveldb，是一个按key有序的K-V引擎，而bbolt中的key是revision.main+revision.sub组成的，所以遍历就会依次经过历史上发生过的所有事务（tx）记录。
对于遍历经过的每个k-v，etcd会反序列化其中的value，也就是mvccpb.KeyValue，判断其中的Key是否为watch请求关注的key，如果是就发送给客户端。

// syncWatchersLoop syncs the watcher in the unsynced map every 100ms.
func (s *watchableStore) syncWatchersLoop() {
defer s.wg.Done()

for {
    s.mu.RLock()
    st := time.Now()
    lastUnsyncedWatchers := s.unsynced.size()
    s.mu.RUnlock()

    unsyncedWatchers := 0
    if lastUnsyncedWatchers > 0 {
        unsyncedWatchers = s.syncWatchers()
    }
    syncDuration := time.Since(st)

    waitDuration := 100 * time.Millisecond
    // more work pending?
    if unsyncedWatchers != 0 && lastUnsyncedWatchers > unsyncedWatchers {
        // be fair to other store operations by yielding time taken
        waitDuration = syncDuration
    }

    select {
    case <-time.After(waitDuration):
    case <-s.stopc:
        return
    }
}
}

上述代码是一个循环，不停的调用syncWatchers：

// syncWatchers syncs unsynced watchers by:
//  1. choose a set of watchers from the unsynced watcher group
//  2. iterate over the set to get the minimum revision and remove compacted watchers
//  3. use minimum revision to get all key-value pairs and send those events to watchers
//  4. remove synced watchers in set from unsynced group and move to synced group
func (s *watchableStore) syncWatchers() int {
s.mu.Lock()
defer s.mu.Unlock()

if s.unsynced.size() == 0 {
    return 0
}

s.store.revMu.RLock()
defer s.store.revMu.RUnlock()

// in order to find key-value pairs from unsynced watchers, we need to
// find min revision index, and these revisions can be used to
// query the backend store of key-value pairs
curRev := s.store.currentRev
compactionRev := s.store.compactMainRev

wg, minRev := s.unsynced.choose(maxWatchersPerSync, curRev, compactionRev)
minBytes, maxBytes := newRevBytes(), newRevBytes()
revToBytes(revision{main: minRev}, minBytes)
revToBytes(revision{main: curRev + 1}, maxBytes)

// UnsafeRange returns keys and values. And in boltdb, keys are revisions.
// values are actual key-value pairs in backend.
tx := s.store.b.ReadTx()
tx.Lock()
revs, vs := tx.UnsafeRange(keyBucketName, minBytes, maxBytes, 0)
evs := kvsToEvents(wg, revs, vs)
tx.Unlock()

代码比较长不全贴，它会每次从所有的watcher选出一批watcher进行批处理（组成为一个group，叫做watchGroup），这批watcher中观察的最小revision.main ID作为bbolt的遍历起始位置，这是一种优化。
你可以想一下，如果为每个watcher单独遍历bbolt并从中摘出属于自己关注的key，那性能就太差了。通过一次性遍历，处理多个watcher，显然可以有效减少遍历的次数。
也许你觉得这样在watcher数量多的情况下性能仍旧很差，但是你需要知道一般的用户行为都是从最新的Revision开始watch，很少有需求关注到很古老的revision，这就是关键。
遍历bbolt时，json反序列化每个mvccpb.KeyValue结构，判断其中的key是否属于watchGroup关注的key，这是由kvsToEvents函数完成的：

// kvsToEvents gets all events for the watchers from all key-value pairs
func kvsToEvents(wg *watcherGroup, revs, vals [][]byte) (evs []mvccpb.Event) {
for i, v := range vals {
    var kv mvccpb.KeyValue
    if err := kv.Unmarshal(v); err != nil {
        plog.Panicf("cannot unmarshal event: %v", err)
    }

    if !wg.contains(string(kv.Key)) {
        continue
    }

    ty := mvccpb.PUT
    if isTombstone(revs[i]) {
        ty = mvccpb.DELETE
        // patch in mod revision so watchers won't skip
        kv.ModRevision = bytesToRev(revs[i]).main
    }
    evs = append(evs, mvccpb.Event{Kv: &kv, Type: ty})
}
return evs
}

可见，删除key对应的revision也会保存到bbolt中，只是bbolt的key比较特别：
put操作的key由main+sub构成：

ibytes := newRevBytes()
idxRev := revision{main: rev, sub: int64(len(tw.changes))}
revToBytes(idxRev, ibytes)

delete操作的key由main+sub+”t”构成：

idxRev := revision{main: tw.beginRev + 1, sub: int64(len(tw.changes))}
revToBytes(idxRev, ibytes)
ibytes = appendMarkTombstone(ibytes)


// appendMarkTombstone appends tombstone mark to normal revision bytes.
func appendMarkTombstone(b []byte) []byte {
if len(b) != revBytesLen {
    plog.Panicf("cannot append mark to non normal revision bytes")
}
return append(b, markTombstone)
}

// isTombstone checks whether the revision bytes is a tombstone.
func isTombstone(b []byte) bool {
return len(b) == markedRevBytesLen && b[markBytePosition] == markTombstone
}