In-depth understanding of zookeeper

      

Overview of zookeeper

Background & Issues

In the production environment, in order to improve service availability and support more users, distributed application services will be deployed on multiple nodes in different IDCs. We are likely to encounter the following problems:

  • How can applications distributed on various machines and IDCs efficiently read and modify configurations?
  • When the configuration changes, how can each node application quickly detect the change and respond to it in a timely manner?
  • How to elect a node among the nodes deployed by the application to act as the leader to perform coordination-related operations? When the leader dies, other nodes can re-initiate the leader election? How to avoid split-brain? How to deal with network partitions?
  • When a node hangs abnormally, how to find out in time?

Points 1 and 2 can be solved by storing the configuration in mysql + regular polling when the number of nodes is small and the performance requirements are not high. If the performance requirements are high, we need to implement a configuration system in combination with components such as cache, agent, configuration change notify, and mysql, such as Taobao's diamond .

Points 3 and 4, in a complex distributed environment, we will encounter a series of problems such as high network latency, network fluctuations, disk failures, machine downtime, half-dead machines, power outages in the computer room, and network partitions. Avoid data inconsistency and split-brain, but also pursue high throughput and low latency, minimize service unavailable time due to leader election, etc. The worst is distributed theory FLP (consensus is impossible with asynchronous systems and even one failure) , CAP (consistency, high availability, partition-tolerance) tells us that we need to make trade-offs in design, and if these are handled by business applications, they will have certain complexity.

Therefore, such complex problems are not suitable for the application to solve by itself. The application needs a God, a trustworthy Oracle, and the Service provided by God should be as simple as possible, easy to understand, high performance, and easy to expand.

In order to solve these problems, Yahoo engineers designed and implemented zookeeper. Why is it called zookeeper? Because many distributed systems in yahoo are named after animal names, and the complexity and confusion in the distributed environment are similar to zoo (zoo) ?The zookeeper is to maintain and manage the order of the entire zoo, which is the origin of the name of the zookeeper.

ZooKeeper is a distributed management service that provides configuration management, name service, state synchronization, cluster management and other functions for applications. Our application scenarios are mainly configuration management and distributed locks.

  • Why does apache define it as a distributed coordination service instead of a storage service? What is its design goal?
  • Can it be used as a storage service? What is its data model? How is it persisted?
  • How is the read and write process of the zookeeper server implemented?
  • How is zookeeper c api implemented?
  • How to implement distributed lock and Leader Election through zookeeper?
  • What problems have we encountered in production environment practice? How to optimize zookeeper performance? Monitor zookeeper?

This article will combine the zookeeper source code (3.4.6) and practical experience in the production environment, analyze the above problems to deeply understand zookeeper, and share the problems and experience we encountered in practice.

First of all, let's take a look at the whole picture of zookeeper and understand its overall architecture and design goals.

zookeeper architecture

zookeeper架构

The number of zookeeper cluster nodes is generally composed of an odd number of nodes, and the node roles are composed of followers and leaders. All write requests need to be forwarded to the leader. Each write request requires more than half of the cluster nodes to respond successfully before the write is successful. The read request is in any follower. It can be processed on all nodes. As long as more than half of the nodes in the cluster are alive and can communicate with each other, the zookeeper cluster can continue to provide services, so it has high availability. The more the number of nodes, the higher the availability will be, but the write performance will be affected, so 5 are generally deployed in the production environment. The interface provided by zookeeper is similar to the nosql system. The commonly used interfaces are get/set/create/getchildren, etc. The interface is simple and easy to use.

zookeeper设计目标

简单

zookeeper数据模型简单,易懂,类似文件系统的层次树形数据结构,存储数据未做shard分散到多机,而是各单机完整存储整个树形层次空间上所有路径的节点数据,数据全部保存在内存,因此可提供高吞吐量、低延迟的服务,也意味着zookeeper不适合保存大节点数据。

高可用、高性能读

因单机上保存了所有数据,若没有多机之间数据同步复制机制,zookeeper系统可用性将极低,因此zookeeper在设计上一个重要目标是可复制的,各节点通过zookeeper atomic broadcast算法选举leader,同步数据。所有写请求follower节点都需转发给leader,读请求在任意一台follower节点上都可以处理。

有序

zookeeper通过基于tcp连接、写请求由leader处理等机制提供有序保证,基于有序机制,zookeeper可以提供同步原语,实现分布式锁等机制。

zookeeper数据模型

存储系统常见的数据模型有关系型表格型(Relational Model)、层次树型(Hierarchical model)、扁平型(Flat model)、网络型(Network Model)、对象型(Object-oriented Model).

五种常见存储模型图

zookeeper的数据模型是层次型,类似文件系统,但是zookeeper的设计目标定位是简单、高可靠、高吞吐、低延迟的内存型存储系统,因此它的value不像文件系统那样会适合保存大的值,官方建议保存的value大小要小于1M,提供的接口类似nosql存储系统(key是路径)。

zookeeper层次模型

那么zookeeper的层次模型是通过什么数据结构实现的呢? get、set、getchildren的时间复杂度又分别是多少呢? 通过阅读zookeeper server源码,zookeeper是基于ConcurrentHashMap实现的,path是key,value是DataNode,DataNode保存了value、children、 stat等信息。

  1. zookeeper database模型的调用链路
  2. ZKDatabase
  3. DataTree
  4. ConcurrentHashMap<String,DataNode> nodes =newConcurrentHashMap<String,DataNode>();
  5. DataNode
  6. data,acl,stat,children
  7. classStat{
  8. long czxid;// created zxid
  9. long mzxid;// last modified zxid
  10. long ctime;// created
  11. long mtime;// last modified
  12. int version;// version
  13. int cversion;// child version
  14. int aversion;// acl version
  15. long ephemeralOwner;// owner id if ephemeral, 0 otw
  16. int dataLength;//length of the data in the node
  17. int numChildren;//number of children of this node
  18. long pzxid;// last modified children
  19. }

ConcurrentHashMap是线程安全的hash table,采用了锁分段技术来减少锁竞争,以提高性能。其结构如下图所示,由两部分组成,Segment和HashEntry,锁的粒度是Segment,每个Segment 对象包含整个散列映射表的若干个桶,散列冲突时通过链表来解决.

ConcurrentHashMap

因此zookeeper在使用ConcurrentHashMap时其各接口期望时间复杂度如下:

  • get:O(1)
  • create/set:O(1)
  • getchildren:O(1)

zookeeper持久化存储

从数据模型我们知道zookeeper所有数据都是加载都内存,基于ConcurrentHashMap构建一颗DataTree,那么zookeeper要保证机器重启数据不丢失就需要实现持久化存储,而zookeeper的持久化实现是通过snapshot、txnlog实现的,snapshot是zookeeper内存数据的完整镜像,zookeeper在运行中会定时生成,txnlog是快照时间点之后的事物日志,zookeeper在重启时,通过snapshot和txnlog重建DataTree. 下图是运行中的zookeeper集群的生成的数据文件。

zookeeper数据文件

snapshot和log文件分布保存在哪?保留多少个snapshot和log文件? 什么时候清理废弃的snapshot和log 文件? 这些都可以通过在zookeeper的zoo.cfg配置文件中指定,dataDir指定snapshot路径,dataLogDir指定事物日志路径,事物日志对zk吞吐量、延时有着非常大的延时,建议datadir与dataLogDir使用不同的设备,避免磁盘IO资源的争夺,影响整个系统性能和稳定性。autopurge.snapRetainCount项表示保留多少个snapshot,每个snapshot快照清理间隔小时可以通过autopurge.purgeInterval来指定。

snapshot的生成和log文件的写入是在SyncRequestProcessor类中实现的,事物日志类TxnLog,快照类FileSnap,事物日志会追加到TxnLog,当记录数大于1000会刷到磁盘,当写入log数大于snapCount/2+randRoll(nextInt(snapCount/2)时,会开启线程将DataTree dump到磁盘,具体实现逻辑如下:

  1. if(zks.getZKDatabase().append(si)){
  2. logCount++;
  3. if(logCount >(snapCount /2+ randRoll)){
  4. randRoll = r.nextInt(snapCount/2);
  5. // roll the log
  6. zks.getZKDatabase().rollLog();
  7. // take a snapshot
  8. if(snapInProcess !=null&& snapInProcess.isAlive()){
  9. LOG.warn("Too busy to snap, skipping");
  10. }else{
  11. snapInProcess =newThread("Snapshot Thread"){
  12. publicvoid run(){
  13. try{
  14. zks.takeSnapshot();
  15. }catch(Exception e){
  16. LOG.warn("Unexpected exception", e);
  17. }
  18. }
  19. };
  20. snapInProcess.start();
  21. }
  22. logCount =0;
  23. }
  24. }
  25. toFlush.add(si);
  26. if(toFlush.size()>1000){
  27. flush(toFlush);
  28. }

从zookeeper持久化的基本实现可知若写请求较大会频繁生成快照,同时因为toFlush是同步刷新数据到磁盘的,所以会影响吞吐率、延时,这也是为什么txnlog建议使用性能较好的存储硬件的原因(如SSD)。

zookeeper核心角色及概念

leader

follower

observer

session

watcher

access control

zookeeper server读写流程分析

在zookeeper的服务端实现中,通过抽象出leader、follower、observer共性特点,读写请求的处理流程可以按照功能拆分成各阶段(pipeline),每个processor负责处理其中一个阶段,采用设计模式的职责链形式,一个processor处理完,通过队列分发到下一个processor中。processor相当于工厂各元部件,而leader、follower、observer只是使用、组装的各元部件不一致,但他们可以高度复用相同的元部件,精简实现,减少代码冗余。

职责链处理类介绍

PrepRequestProcessor

此处理类根据请求的命令(create,set等)负责生成事物请求信息数据结构request,统计正在进行的事物等。

FollowerRequestProcessor

此处理类负责将写请求分发给leader.

CommitProcessor

SyncRequestProcessor

如前面持久化存储所述,此处理类负责持久化存储,将批量事物日志刷新到磁盘和定时生成快照。

SendAckRequestProcessor

此处理类在收到写请求提议后,回复ACK给leader.

ProposalRequestProcessor

此处理类负责将所有写请求转发给follower节点。

ToBeAppliedRequestProcessor

FinalRequestProcessor

此处理类如名字所言,是请求流行线式处理最后一环,负责处理查询请求(从zkdatabase的DataTree读取数据)和写事务请求。

zookeeper读流程

zookeeper写流程

zookeeper c api

总结

参考资料

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326314042&siteId=291194637