前言

上一篇笔记记录了一半，因为原文篇幅较长，内容较多，那就将它拆分成几篇笔记来记录。

正文

Data model and the hierarchical namespace

数据模型和层级化的命名空间

The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper’s name space is identified by a path.

ZooKeeper所提供的命名命名空间非常像一个标准的文件系统。一个名字是被slash（/）分开的路径元素的顺序。在ZooKeeper命名空间的每一个节点被一个路径所标识。意思是它的名字就是一个路径？

ZooKeeper’s Hierarchical Namespace

Nodes and ephemeral nodes

节点和临时节点（ephemeral有一个意思是朝生暮死，这个词感觉挺高级）

Unlike is standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.

不像标准文件系统，（这里应该多了一个“is”），每一个在ZooKeeper命名空间里面的节点可以有数据和它关联，就像孩子一样。这就像一个文件系统允许一个文件同时可以是一个目录。（ZooKeeper被设计去储存协调数据：status information，configuration，location information，等等，所以存储在每个节点上的数据通常很小，在KB范围）我们使用术语znode来代表它。

Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode’s data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.

Znodes维持着一个stat结构，包含着数据变化，ACL变化的version number，和时间戳，来允许cache去确认和协同更新。每一次一个znode的数据发生改变，这个
version numbers增加。这个地方让我想起Git的设计。

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

在一个namespace的znode的数据存储是被原子性地读和写。读（为什么这里写的Reads，为什么有一个s）获取一个znode关联的所有数据，一次写替换所有数据。

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Ephemeral nodes are useful when you want to implement [tbd].

临时节点在创建节点的会话处于活跃时存货。当会话结束，znode被删除。

Conditional updates and watches

ZooKeeper supports the concept of watches. Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes. When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification. These can be used to [tbd].

客户端可以监控（watch）一个znode（这里又加了一个s）。在数据改变时，watch会被触发并且删除。（是说监控是一次的）

Guarantees

ZooKeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronization, it provides a set of guarantees. These are:

Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

For more information on these, and how they can be used, see [tbd]

Simple API

One of the design goals of ZooKeeper is provide a very simple programming interface. As a result, it supports only these operations:

create

creates a node at a location in the tree

delete

deletes a node

exists

tests if a node exists at a location

get data

reads the data from a node

set data

writes data to a node

get children

retrieves a list of children of a node

sync

waits for data to be propagated

For a more in-depth discussion on these, and how they can be used to implement higher level operations, please refer to [tbd]

Implementation

ZooKeeper Components shows the high-level components of the ZooKeeper service. With the exception of the request processor, each of the servers that make up the ZooKeeper service replicates its own copy of each of components.

ZooKeeper Components（组件）展示了高级别的ZooKeeper service组件。除请求处理器外，构成ZooKeeper服务的每个服务器都复制其自己的每个组件的副本。

The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.

内存数据库包含所有数据，并且是可恢复的，因为先做了持久化。

Every ZooKeeper server services clients. Clients connect to exactly one server to submit requests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests, are processed by an agreement protocol.

As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.

有一台server被称为leader，所有写的请求会被转发leader，给其余被称为followers，从leader读取消息，并就消息传递达成一致。这个messaging layer层来负责处理在失败时leader的取代，和同步followers with leaders。

ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.

Uses

The programming interface to ZooKeeper is deliberately simple. With it, however, you can implement higher order operations, such as synchronizations primitives, group membership, ownership, etc. Some distributed applications have used it to: [tbd: add uses from white paper and video presentation.] For more information, see [tbd]

下面是对性能的一些介绍，活着说是宣传，暂时感觉不用细看，略过。

Performance

ZooKeeper is designed to be highly performant. But is it? The results of the ZooKeeper’s development team at Yahoo! Research indicate that it is. (See ZooKeeper Throughput as the Read-Write Ratio Varies.) It is especially high performance in applications where reads outnumber writes, since writes involve synchronizing the state of all servers. (Reads outnumbering writes is typically the case for a coordination service.)

The figure ZooKeeper Throughput as the Read-Write Ratio Varies is a throughput graph of ZooKeeper release 3.2 running on servers with dual 2Ghz Xeon and two SATA 15K RPM drives. One drive was used as a dedicated ZooKeeper log device. The snapshots were written to the OS drive. Write requests were 1K writes and the reads were 1K reads. “Servers” indicate the size of the ZooKeeper ensemble, the number of servers that make up the service. Approximately 30 other servers were used to simulate the clients. The ZooKeeper ensemble was configured such that leaders do not allow connections from clients.

Note

In version 3.2 r/w performance improved by ~2x compared to the previous 3.1 release.

Benchmarks also indicate that it is reliable, too. Reliability in the Presence of Errors shows how a deployment responds to various failures. The events marked in the figure are the following:

Failure and recovery of a follower
Failure and recovery of a different follower
Failure of the leader
Failure and recovery of two followers
Failure of another leader

Reliability

To show the behavior of the system over time as failures are injected we ran a ZooKeeper service made up of 7 machines. We ran the same saturation benchmark as before, but this time we kept the write percentage at a constant 30%, which is a conservative ratio of our expected workloads.

The are a few important observations from this graph. First, if followers fail and recover quickly, then ZooKeeper is able to sustain a high throughput despite the failure. But maybe more importantly, the leader election algorithm allows for the system to recover fast enough to prevent throughput from dropping substantially. In our observations, ZooKeeper takes less than 200ms to elect a new leader. Third, as followers recover, ZooKeeper is able to raise throughput again once they start processing requests.

The ZooKeeper Project

ZooKeeper has been successfully used in many industrial applications. It is used at Yahoo! as the coordination and failure recovery service for Yahoo! Message Broker, which is a highly scalable publish-subscribe system managing thousands of topics for replication and data delivery. It is used by the Fetching Service for Yahoo! crawler, where it also manages failure recovery. A number of Yahoo! advertising systems also use ZooKeeper to implement reliable services.

All users and developers are encouraged to join the community and contribute their expertise. See the Zookeeper Project on Apache for more information.

总结

看完这一篇，ZooKeeper的模型基本明白了。剩余的问题：

具体ZooKeeper是如何完成它所提供的那几个原语的呢？
leader的选举是如何进行的？

参考

https://zookeeper.apache.org/doc/current/zookeeperOver.html

ZooKeeper学习笔记--介绍（2）

前言

正文