Kafka Learning Road (3) Kafka's High Availability

1. The origin of high availability

1.1 Why do you need Replication

  In Kafka versions prior to 0.8, there is no Replication. Once a Broker goes down, all partition data on it cannot be consumed, which is contrary to the design goals of Kafka data durability and Delivery Guarantee. At the same time, the Producer can no longer store data in these Partitions.

  If the Producer uses synchronous mode, the Producer will throw an Exception after trying to resend message.send.max.retries (the default value is 3) times. The user can choose to stop sending subsequent data or continue to choose to send. The former will cause data blocking, and the latter will cause the loss of data that should have been sent to the Broker.

  If the Producer uses asynchronous mode, the Producer will try to resend message.send.max.retries (the default value is 3) times and then log the exception and continue to send subsequent data, which will cause data loss and the user can only find the problem through the log . At the same time, Kafka's Producer does not provide a callback interface for asynchronous mode.

  It can be seen that in the absence of Replication, once a machine goes down or a Broker stops working, the availability of the entire system will be reduced. As the size of the cluster increases, the probability of such anomalies in the entire cluster greatly increases. Therefore, the introduction of the Replication mechanism is very important for the production system.

1.2 Leader Election

  After Replication is introduced, the same Partition may have multiple Replicas, and at this time, a Leader needs to be selected between these Replications. The Producer and Consumer only interact with this Leader, and other Replicas act as Followers to copy data from the Leader.

  Because it is necessary to ensure the data consistency between multiple Replicas of the same Partition (after one of them goes down, the other Replica must be able to continue to serve and neither cause data duplication nor data loss). If there is no Leader, all Replicas can read/write data at the same time, then it is necessary to ensure that multiple Replicas synchronize data with each other (N×N paths), the consistency and order of data are very difficult to guarantee, which greatly increases The complexity of Replication implementation also increases the chance of exceptions. After the introduction of the leader, only the leader is responsible for data reading and writing, and the follower only fetches data (N paths) to the leader in sequence, and the system is simpler and more efficient.

2. Analysis of Kafka HA design

2.1 How to distribute all Replica evenly to the entire cluster

In order to do better load balancing, Kafka tries to evenly distribute all Partitions to the entire cluster. A typical deployment method is that the number of Partitions of a Topic is greater than the number of Brokers. At the same time, in order to improve the fault tolerance of Kafka, it is also necessary to distribute the replicas of the same Partition to different machines as much as possible. In fact, if all Replicas are on the same Broker, once the Broker goes down, all Replicas in the Partition will not be able to work, and the effect of HA will not be achieved. At the same time, if a Broker goes down, it is necessary to ensure that the load on it can be evenly distributed to all other surviving Brokers.

The algorithm for Kafka to allocate Replica is as follows:

1. Sort all Brokers (assuming a total of n Brokers) and the Partition to be allocated

2. Assign the i-th Partition to the (i mod n)-th Broker

3. Assign the jth Replica of the ith Partition to the ((i + j) mode n)th Broker

2.2 Data Replication (replication strategy)

The guarantee of Kafka's high reliability comes from its robust replication strategy.

2.2.1 Messaging synchronization strategy

When the Producer publishes a message to a Partition, it first finds the Leader of the Partition through ZooKeeper, and then no matter what the Replication Factor of the Topic is, the Producer only sends the message to the Leader of the Partition. Leader will write the message to its local Log. Each Follower pulls data from the Leader. In this way, the order of data stored by the Follower is consistent with that of the Leader. After the Follower receives the message and writes it to its Log, it sends an ACK to the Leader. Once the Leader has received ACKs from all Replicas in the ISR, the message is considered to have been committed, the Leader will increase the HW and send an ACK to the Producer.

In order to improve performance, each Follower sends an ACK to the Leader immediately after receiving the data, rather than waiting for the data to be written to the Log. Therefore, for messages that have been committed, Kafka can only guarantee that it is stored in the memory of multiple replicas, but cannot guarantee that they are persisted to disk, and it cannot fully guarantee that the message will be consumed by the Consumer after an exception occurs. .

The Consumer reads the message from the Leader, and only the committed message will be exposed to the Consumer.

The data flow of Kafka Replication is shown in the following figure:

2.2.2 How many backups need to be guaranteed before ACK

For Kafka, defining whether a Broker is "alive" includes two conditions:

  • One is that it must maintain a session with ZooKeeper (this is achieved through ZooKeeper's Heartbeat mechanism).
  • The second is that the Follower must be able to copy the Leader's message in time and not "be too far behind".

The Leader keeps track of the list of Replicas that it keeps in sync with, this list is called ISR (ie in-sync Replica). If a follower goes down, or falls too far behind, the leader will remove it from the ISR. "Too much behind" described here means that the number of messages replicated by the Follower is behind the Leader by more than a predetermined value (this value can be configured in $KAFKA_HOME/config/server.properties through replica.lag.max.messages, which defaults to The value is 4000) or the Follower exceeds a certain period of time (this value can be configured through replica.lag.time.max.ms in $KAFKA_HOME/config/server.properties, the default value is 10000) and does not send a fetch request to the Leader.

Kafka's replication mechanism is neither fully synchronous replication nor purely asynchronous replication. In fact, fully synchronous replication requires all working followers to be replicated before this message is considered a commit. This replication method greatly affects the throughput (high throughput is a very important feature of Kafka). In the asynchronous replication mode, the follower replicates data from the leader asynchronously. As long as the data is written to the log by the leader, it is considered to have committed. In this case, if the follower is all copied and lags behind the leader, and if the leader suddenly goes down, it will Lost data. Kafka's way of using ISR is a good balance to ensure that data is not lost and throughput. Followers can copy data from Leaders in batches, which greatly improves the replication performance (batch writes to disk) and greatly reduces the gap between Followers and Leaders.

It should be noted that Kafka only solves fail/recover, not the "Byzantine" ("Byzantine") problem. A message will only be considered committed if it has been copied from the Leader by all Followers in the ISR. This prevents some data from being written into the Leader, and it will crash before it can be replicated by any Follower, resulting in data loss (Consumer cannot consume these data). For the Producer, it can choose whether to wait for the message to commit, which can be set by request.required.acks. This mechanism ensures that a committed message will not be lost as long as the ISR has one or more Followers.

2.2.3 Leader Election Algorithm

Leader election is essentially a distributed lock. There are two ways to implement a ZooKeeper-based distributed lock:

  • Node name uniqueness: Multiple clients create a node, and only the client that successfully creates the node can obtain the lock
  • Temporary sequence node: All clients create their own temporary sequence node in a directory, and only the one with the smallest sequence number acquires the lock

A very common way of electing a leader is the "Majority Vote" ("majority vote"), but Kafka does not use this approach. In this mode, if we have 2f+1 Replicas (including Leader and Follower), then we must ensure that there are f+1 Replicas to copy the message before commit. In order to ensure the correct selection of the new Leader, the Failed Replica cannot exceed f. Because in any remaining f+1 Replica, at least one Replica contains all the latest messages. This method has a great advantage. The latency of the system only depends on the fastest brokers, not the slowest one. Majority Vote also has some disadvantages. In order to ensure the normal progress of Leader Election, the number of followers that it can tolerate is relatively small. If you want to tolerate 1 follower to hang, you must have more than 3 replicas, and if you want to tolerate 2 followers to hang, you must have more than 5 replicas. That is to say, in order to ensure a high degree of fault tolerance in a production environment, there must be a large number of replicas, and a large number of replicas will lead to a sharp drop in performance under large data volumes. This is why this algorithm is more commonly used in systems like ZooKeeper that share a cluster configuration and rarely in systems that need to store large amounts of data. For example, the HA Feature of HDFS is based on the major-vote-based journal, but its data storage does not use this method.

Kafka dynamically maintains an ISR (in-sync replicas) in ZooKeeper. All replicas in this ISR keep up with the leader. Only members in the ISR can be selected as leaders. In this mode, for f+1 replicas, a Partition can tolerate the failure of f replicas without losing committed messages. In most usage scenarios, this mode is very beneficial. In fact, in order to tolerate the failure of f replicas, Majority Vote and ISR need to wait the same number of replicas before committing, but the total number of replicas required by ISR is almost half of Majority Vote.

Although Majority Vote has the advantage of not having to wait for the slowest Broker compared to ISR, the Kafka author believes that Kafka can improve this problem by choosing whether to be blocked by the Producer, and the Replica and disk saved make the ISR mode still worthwhile .

2.2.4 How to deal with all Replica not working

When there is at least one follower in the ISR, Kafka can ensure that the committed data is not lost, but if all the replicas of a Partition are down, there is no guarantee that the data will not be lost. There are two possible scenarios in this case:

1. Wait for any Replica in the ISR to "live" and select it as the Leader

2. Select the first "live" Replica (not necessarily in the ISR) as the Leader

This requires a simple compromise between availability and consistency. If you have to wait for the Replica in the ISR to "live", the unavailable time may be relatively long. And if all Replicas in the ISR fail to "live", or the data is lost, the Partition will never be available. Select the first "live" Replica as the Leader, and this Replica is not the Replica in the ISR, then even if it does not guarantee that it contains all the committed messages, it will become the Leader and serve as the data source of the consumer (previous article There are instructions, all reads and writes are done by the Leader). Kafka0.8.* uses the second method. According to Kafka's documentation, in future versions, Kafka supports users to choose one of these two methods through configuration, so as to choose high availability or strong consistency according to different usage scenarios.

2.2.5 Election of Leader

The simplest and most intuitive solution is that all Followers set up a Watch on ZooKeeper. Once the Leader goes down, its corresponding ephemeral znode will be automatically deleted. At this time, all Followers try to create the node, and the successful ones (ZooKeeper guarantees that only One that can be created successfully) is the new Leader, and the other Replicas are Followers.

But there are 3 problems with this method:

1. split-brain This is caused by the characteristics of ZooKeeper. Although ZooKeeper can ensure that all Watches are triggered in sequence, it cannot guarantee that all Replicas "see" the same state at the same time, which may cause different Replica responses inconsistent

2.herd effect If there are many Partitions on the Broker that is down, it will cause multiple Watches to be triggered, resulting in a large number of adjustments in the cluster

3. ZooKeeper is overloaded. Each Replica needs to register a Watch on ZooKeeper for this purpose. When the cluster scale increases to several thousand Partitions, the ZooKeeper load will be too heavy.

The Leader Election solution of Kafka 0.8.* solves the above problem. It elects a controller among all brokers, and the leader election of all Partitions is determined by the controller. The controller will notify the leader's change directly through RPC (more efficient than the ZooKeeper Queue method) to the Broker that needs to respond to this. At the same time, the controller is also responsible for adding and deleting topics and redistributing Replica.

3. HA related ZooKeeper structure

3.1 admin

The znode in this directory will only exist when there is a related operation, and it will be deleted at the end of the operation

/admin/reassign_partitions is used to assign some Partitions to different broker sets. For each Partition to be reallocated, Kafka stores all its Replicas and corresponding Broker ids on that znode. The znode is created by the management process and will be automatically removed once the reallocation is successful.

3.2 broker

i.e. /brokers/ids/[brokerId]) stores information about "alive" brokers.

Topic registration information (/brokers/topics/[topic]), the broker id where all replicas of all partitions of the topic are stored, the first replica is the preferred replica, and for a given partition, it is on the same broker There is at most one replica, so the broker id can be used as the replica id.

3.3 controller

/controller -> int (broker id of the controller) stores the information of the current controller

/controller_epoch -> int (epoch) stores the controller epoch directly as an integer instead of as a JSON string like other znodes.

Fourth, the producer publishes messages

4.1 Writing method

The producer uses the push mode to publish messages to the broker, and each message is appended to the patition, which belongs to sequential writing to disk (sequential writing to disk is more efficient than random writing to memory, ensuring kafka throughput).

4.2 Message Routing

When the producer sends a message to the broker, it will choose which partition to store it in according to the partition algorithm. Its routing mechanism is:

1. If patition is specified, use it directly;
2. If the patition is not specified but the key is specified, a patition is selected by hashing the value of the key
3. If neither patition nor key is specified, use polling to select a patition.

4.3 Writing Process

The producer writes the message sequence diagram as follows:

Flow Description:

1. The producer first finds the leader of the partition from the "/brokers/.../state" node of zookeeper
2. The producer sends the message to the leader
3. The leader writes the message to the local log
4. The followers pull messages from the leader, and the leader sends ACK after writing to the local log
5. After the leader receives the ACKs of all replicas in the ISR, it increases the HW (high watermark, the offset of the last commit) and sends an ACK to the producer

5. The broker saves the message

5.1 Storage method

Physically divide the topic into one or more patitions (corresponding to the num.partitions=3 configuration in server.properties), each patition physically corresponds to a folder (the folder stores all the messages and index files of the patition), as follows :

5.2 Storage Policy

Kafka persists all messages whether they are consumed or not. There are two strategies for deleting old data:

1. Based on time: log.retention.hours=168
2. Based on size: log.retention.bytes=1073741824

6. Topic creation and deletion

6.1 Create topic

The sequence diagram for creating a topic is as follows:

Flow Description:

1. The controller registers the watcher on the /brokers/topics node of ZooKeeper. When the topic is created, the controller will obtain the partition/replica allocation of the topic through the watch.
2. The controller reads the list of all currently available brokers from /brokers/ids, for each partition in set_p:
     2.1. Select an available broker from all replicas (called AR) assigned to the partition as the new leader, and set AR as the new ISR
     2.2. Write the new leader and ISR to /brokers/topics/[topic]/partitions/[partition]/state
3. The controller sends a LeaderAndISRRequest to the relevant broker through RPC.

6.2 Delete topic

The sequence diagram for deleting a topic is as follows:

Flow Description:

1. The controller registers the watcher on the /brokers/topics node of zooKeeper. When the topic is deleted, the controller will obtain the partition/replica allocation of the topic through the watch.
2. If delete.topic.enable=false, end; otherwise, the watch registered by the controller on /admin/delete_topics is fired, and the controller sends a StopReplicaRequest to the corresponding broker through the callback.

七、broker failover

The sequence diagram of kafka broker failover is as follows:

Flow Description:

1. The controller registers Watcher at the /brokers/ids/[brokerId] node of zookeeper, and zookeeper will fire watch when the broker is down
2. The controller reads the available brokers from the /brokers/ids node
3. The controller decides to set_p, which contains all partitions on the down broker
4. For each partition in set_p
    4.1. Read ISR from /brokers/topics/[topic]/partitions/[partition]/state node
    4.2. Determine the new leader
    4.3. Write the new leader, ISR, controller_epoch and leader_epoch information to the state node
5. Send the leaderAndISRRequest command to the relevant broker through RPC

八、controller failover

The controller failover is triggered when the controller goes down. Each broker will register a watcher on the "/controller" node of zookeeper. When the controller goes down, the temporary node in zookeeper disappears. All surviving brokers receive a fire notification. Each broker tries to create a new controller path. There is only one The election was successful and was elected as the controller.

When a new controller is elected, the KafkaController.onControllerFailover method will be triggered, and the following operations will be done in this method:

1. Read and increase the Controller Epoch.
2. Register the watcher on the reassignedPartitions Patch (/admin/reassign_partitions).
3. Register watcher on preferredReplicaElection Path(/admin/preferred_replica_election).
4. Register the watcher on the broker Topics Patch (/brokers/topics) through partitionStateMachine.
5. If delete.topic.enable=true (the default value is false), partitionStateMachine registers watcher on Delete Topic Patch (/admin/delete_topics).
6. Register Watch on the Broker Ids Patch (/brokers/ids) through replicaStateMachine.
7. Initialize the ControllerContext object, set all current topics, the list of "live" brokers, the leaders and ISRs of all partitions, etc.
8. Start replicaStateMachine and partitionStateMachine.
9. Set the brokerState state to RunningAsController.
10. Send the leadership information of each partition to all "live" brokers.
11. If auto.leader.rebalance.enable=true (the default value is true), start the partition-rebalance thread.
12. If delete.topic.enable=true and there is a value in Delete Topic Patch(/admin/delete_topics), delete the corresponding topic.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325650503&siteId=291194637