Some experience in Zookeeper operation and maintenance

Zookeeper is a distributed coordination framework with good performance and has been verified by many companies, so it is used in many scenarios. People generally use Zookeeper to implement service discovery (similar to DNS), configuration management, distributed locks, leader election, etc. In these scenarios, Zookeeper becomes a dependent core component, and the stability of Zookeeper needs special attention.

  Qunar also relies on Zookeeper in many scenarios, so we have been exploring how to better operate and maintain a stable Zookeeper cluster. We have also stepped on some pits in the past few years, and also caused failures due to Zookeeper. Now we will share some of our experience in operating and maintaining the Zookeeper cluster, and welcome everyone to provide better suggestions.

  So before planning to operate and maintain a Zookeeper cluster, let's first understand some of the basic principles of Zookeeper.

  There are three roles in the cluster: Leader, Follower and Observer. Leader and Follower participate in voting, Observer will only "listen" to the results of voting, and will not participate in voting.

  The number of nodes in the voting cluster is required to be an odd

  number. The equation for the number of nodes that a cluster can tolerate to fail is N = 2F + 1, where N is the number of voting cluster nodes, and F is the number of nodes that can tolerate failure at the same time. For example, a three-node cluster can fail one node, and a five-node cluster can fail two...

  A write operation requires more than half of the nodes to ack, so the more nodes in the cluster, the more nodes the entire cluster can withstand. more (more reliable), but worse throughput.

  All nodes and node data in Zookeeper will be placed in memory to form a tree data structure. And regularly dump snapshot to disk.

  Zookeeper's Client and Zookeeper maintain a long connection and maintain a heartbeat. The Client will negotiate a Session timeout time with Zookeeper (in fact, the minimum and maximum values ​​are configured in the Zookeeper Server, if the value of the client is here The client's value is used between the two values. If it is less than the minimum value, the minimum value is used, and if it is greater than the maximum value, the maximum value is used.) If no heartbeat is received within the session timeout period, the session expires.

  The client can watch a node or data in the tree data structure of Zookeeper, and will be notified when there is a change.

  With these understandings, then we actually have a bottom line in our hearts.

  1. Minimum production cluster

  To ensure that Zookeeper can run stably, it is necessary to ensure that voting can be carried out normally. It is best not to hang one node and not work at all, so we generally require at least 5 nodes to be deployed.

  2. Network

  In addition to nodes, we also need to see if a physical machine, a cabinet or a switch fails and then affects the entire cluster, so the network structure of the nodes should also be considered. This may be more stringent than the requirements of many application servers.

  3. Divide into groups to protect core groups

  To ensure the reliable operation of the entire Zookeeper cluster is to ensure that the voting cluster is reliable. Here, we divide a Zookeeper cluster into multiple small groups. We call Leader+Follower as the core group. We generally do not provide services to the core group, and then we will add some Observers according to different businesses. For example, a Zookeeper cluster provides services for three different components: service discovery, messages, and scheduled tasks. Then we establish three Observer Groups for use by these three components, and the Client will only connect to the Observer Group assigned to it, not to connect to the core group. In this way, the core group will not provide the client with long-term connection services, nor is it responsible for the heartbeat of the long-term connection, which greatly reduces the pressure on the core group, because in the actual environment, a Zookeeper cluster needs to provide services for tens of thousands of machines and maintain Long connections and heartbeats still consume a certain amount of resources. Because the Observer does not participate in voting, adding the Observer will not reduce the overall throughput, and the failure of the Observer will not affect the health of the entire cluster.

  However, it should be noted here that sub-Observer Group can only solve part of the problem, because after all, all writes must be handed over to the core Group for processing, so for applications with a particularly large amount of writes, it is still necessary to perform clustering. Isolation, such as Storm and Kafka, puts a lot of pressure on Zookeeper, and you can't co-locate it with a service discovery cluster.

  4. Memory

  Because Zookeeper puts all data in memory, the memory of the JVM and the machine should also be planned in advance. If there is a Swap, it will seriously affect the performance of the Zookeeper cluster, so I generally do not recommend using Zookeeper as a general-purpose configuration management service. Because the general configuration data is still quite large, it is not very controllable to put all these in memory.

  5. Log cleanup

  Because Zookeeper frequently writes txlog (a sequential log written by Zookeeper) and regularly dumps memory snapshots to disk, so the disk usage becomes larger and larger, so Zookeeper provides a mechanism to clean up these files, but this mechanism is not too Reasonably, it can only set the interval for cleaning, but not a specific time period. Then it is possible to clean up during peak periods, so it is recommended to turn it off: autopurge.purgeInterval=0. Then use mechanisms such as crontab to clean up when the business is at a low point.

  6. Log, jvm configuration

  If the package downloaded directly from the official website is directly started and run, it is very bad. The default configuration log of this package will not be rotated, and it will be output directly to the terminal. We didn't understand this at first, and then after running for a while, we found that a huge log file of zookeeper.out was generated. Besides, this default configuration doesn't set any JVM related parameters (so the heap size is a default value), which is also not desirable. Then some students said that I will modify the startup script of Zookeeper. It is best not to do this. Zookeeper will load a script named zookeeper-env.sh in the conf folder, so you can write some customized configuration here instead of directly modifying the script that comes with Zookeeper.

  #!/usr/bin/env bash

  JAVA_HOME= #java home

  ZOO_LOG_DIR= #Path where log files are placed

  ZOO_LOG4J_PROP="INFO,ROLLINGFILE" #Set log rotation

  JVMFLAGS="Some settings of jvm, such as heap size, open gc log, etc."

  7. Address

  In the actual environment, we may need to migrate the Zookeeper cluster for various reasons, such as machine out-of-warranty, hardware failure, etc., so the Zookeeper address is a headache. There are two aspects to this address. The first one is the address provided to the Client. It is recommended that this address be delivered by configuration, and not be used directly by the user. We did not do well in the early stage. The other is in the cluster configuration. Communication between clusters is required, and addresses are also required. Our approach is to set hosts:

  192.168.1.20 zk1

  192.168.1.21 zk2

  192.168.1.22 zk3

  in the configuration:

  server.1=zk1:2081:3801

  server.2=zk2:2801:3801

  server.3=zk3:2801: 3801

  In this way, when we need to migrate, we stop the old node, and start a new node only by modifying the hosts mapping. For example, now server.3 needs to be migrated, then we map zk3 to the new ip address in hosts. But there is a problem with java, java will permanently cache DNS cache by default, even if you map zk3 to another ip, if you do not restart server.1, server.2, it will not resolve to the new ip, this You need to modify networkaddress.cache.ttl=60 in the $JAVA_HOME/jre/lib/security/java.security file to a smaller number.

  For this migration problem, we also encountered a relatively embarrassing situation, which will be mentioned in the final pit.

  8. Log Location

  Zookeeper mainly generates three types of IO: txlog (each write operation, including a new session, will record a log), Snapshot and running application logs. It is generally recommended to spread these three IOs to three different disks. However, we have never done such an experiment. Our Zookeeper is also running on a virtual machine (generally considered that virtual machine IO is poor).

  9. Monitoring

  We have done some monitoring on Zookeeper:

  a. Whether it is writable. It is a timed task to create nodes, delete nodes and other operations regularly. It should be noted here that Zookeeper is a cluster. When we monitor, I still want to monitor a single node, so do not connect the entire cluster during these operations, but directly connect to a single node.

  b. Monitor the number of watchers and connections, especially when these two data fluctuate greatly, you can find out whether the user has misused

  c. Network traffic and client ip will be recorded in the monitoring system, so that it can be quickly Discovering the "black sheep"

  10. Some usage suggestions

  a. Don't rely heavily on Zookeeper, that is, if there is a problem with Zookeeper, the business can already run normally. Zookeeper is a distributed coordination framework, the main thing is the consistency of the distributed environment. This is a very demanding thing, so its stability is affected in many ways. For example, we often use Zookeeper for service discovery. In fact, service discovery does not require strict consistency. We can cache the server list. When there is a problem with Zookeeper, it can work normally. In this regard, etcd should do better. If there is a partition in Zookeeper, the minority cannot provide any services, and neither can read it, and the minority of etcd can still provide read services, which is good for service discovery.

  b. Don't stuff a lot of stuff into Zookeeper, which has been mentioned above.

  c. Do not use Zookeeper for fine-grained locks. For example, many businesses use Zookeeper for distributed locks at the granularity of orders, which will frequently interact with Zookeeper, put a lot of pressure on Zookeeper, and have a wide-ranging impact once a problem occurs. But you can use coarse-grained locks (in fact, leader election is also a kind of lock).

  d. The second reason why general configuration is not recommended is that general configuration should be provided to many systems, and some public configurations or even all systems will be used. Once such a configuration changes, Zookeeper will broadcast it to all watchers. , and then all clients pull it, causing a very large network traffic in an instant, causing the so-called "shock group". When you implement a general configuration system by yourself, you generally use queuing or batch notification for this configuration.

  11. Some pits

  a. zookeeper client 3.4.5 There is a problem with the ping interval algorithm, and the connection will be disconnected after a ping failure due to network jitter and other reasons. 3.4.6 resolves this issue Bug1751.

  b. If the zookeeper client is disconnected due to network jitter, if it is reconnected later, the zookeeper client will automatically subscribe all the previously subscribed watchers, etc., and Zookeeper has a default size of 1M for a single data packet. Limits, which often exceed the limit, resulting in constant retries. This problem has been fixed in newer versions. Bug706

  c. An UnresolvedAddressException is thrown, causing the Zookeeper election thread to exit, and the entire cluster can no longer be elected and is on the verge of collapse. The problem is that once OPS migrated the machine, the old machine was recycled, so the IP and machine name of the old machine no longer existed, and finally the UnresolvedAddressException exception was thrown, and the election thread of Zookeeper (Listener in the QuorumCnxManager class) only The IOException is caught, causing the thread to exit. Once the thread exits, as long as the current leader has a problem and needs to be re-elected, a new leader will not be elected, and the entire cluster will collapse. Bug2319 (PS, this bug is my report)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325570545&siteId=291194637