ZooKeeper集群高可用

今天按照zookeeper官网的文档配置了一套zookeeper集群。链接如下:

http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_systemReq

按照官网的建议,我配置了3台的测试环境,三台完全启动之后,查看状态,两台是follower,一台是leader.

./bin/zkServer.sh status
JMX enabled by default
Using config: /opt/yunfei/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: follower

./bin/zkServer.sh status
JMX enabled by default
Using config: /opt/yunfei/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: leader

下一步,开始测试高可用。

1. 运行kill -9 命令杀死三个zookeeper中的一个。集群运行正常。

2. 运行kill -9 命令杀死三个zookeeper中的两个。集群全部失效。

查看状态如下:

./bin/zkServer.sh status
JMX enabled by default
Using config: /opt/yunfei/zookeeper-3.4.6/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

也就是说三台服务器中,如果一台宕机,集群是可以继续正常运行的,但是两台宕机就不可以了。

查阅官网文档之后,发现如下的doc:

Cross Machine Requirements
For the ZooKeeper service to be active, there must be a majority of non-failing machines that can communicate with each other. To create a deployment that can tolerate the failure of F machines, you should count on deploying 2xF+1 machines. Thus, a deployment that consists of three machines can handle one failure, and a deployment of five machines can handle two failures. Note that a deployment of six machines can only handle two failures since three machines is not a majority. For this reason, ZooKeeper deployments are usually made up of an odd number of machines.
To achieve the highest probability of tolerating a failure you should try to make machine failures independent. For example, if most of the machines share the same switch, failure of that switch could cause a correlated failure and bring down the service. The same holds true of shared power circuits, cooling systems, etc.

 总结起来,就是如下几点:

1.设置zookeeper集群是一定要部署为2xF+1(即奇数)个server,这样可以允许F个Server出错(宕机或其他)。即:假设有三台server,则最多可以允许一台服务器宕机。集群继续正常工作。

2. 设计zookeeper集群时,为了尽量抬高可靠性、容错性。部署服务器的时候,尽量让服务器位于不同的机房、或者在同一个机房中连接不同的交换机。防止一台交换机出错时,导致整体不可用。

猜你喜欢

转载自yunjianfei.iteye.com/blog/2164210