0x00 Summary
This article on the article, based on a more in-depth explanation of the mechanism of Kafka's HA, HA related mainly elaborated various scenarios, such as Broker failover, Controller failover, Topic creation / deletion, Broker starts, detailed processing Follower Leader fetch data from other processes . At the same time it describes the tools associated with the Replication of Kafka offered, such as redistribution of Partition and so on.
0x01 Broker Failover process
1.1 Controller of the processing of Broker Failure
- Controller in Zookeeper's
/brokers/ids
registration Watch the node. Once Broker downtime (herein represent any let down by Kafka considered Broker die scenarios, including but not limited to power off the machine, the network is unavailable, GC lead to Stop The World, a process crash, etc.), which corresponds to the Zookeeper Znode will be automatically deleted, Zookeeper will fire Controller register Watch, Controller Broker can obtain a list of the latest surviving. - Controller decided set_p, the set contains all the Partition on all Broker downtime.
- Each of set_p Partition:
3.1 From/brokers/topics/[topic]/partitions/[partition]/state
reading the Partition current ISR.
3.2 determine the new Leader of the Partition. If the current ISR at least a Replica also survived, then select one of them as the new Leader, a new current ISR ISR contains all surviving Replica. Otherwise, select the Partition of any one surviving Replica (there may be potential data loss under this scenario) as a new Leader and ISR. If all Replica of the Partition are down, the new Leader will be set to -1.
3.3 The New Leader, ISR and newleader_epoch
andcontroller_epoch
write/brokers/topics/[topic]/partitions/[partition]/state
. Note that this operation will be performed only if no change Controller version 3.1 to 3.3 in the process, otherwise jump to 3.1. - RPC commands sent directly LeaderAndISRRequest to set_p relevant Broker. Controller commands can be transmitted in a plurality of RPC operations to improve efficiency.
Broker failover sequence shown as FIG.
LeaderAndIsrRequest following structure
LeaderAndIsrResponse following structure
1.2 Create / Delete topic
- Controller in Zookeeper's
/brokers/topics
registered node Watch, once a Topic is created or deleted, the Controller will be newly created by Watch / delete Topic of Partition / Replica assignment. - For the delete operation Topic, Topic tool stored in the Topic name
/admin/delete_topics
. Ifdelete.topic.enable
true, the Controller registered/admin/delete_topics
on Fire Watch is, corresponding to the Controller transmits StopReplicaRequest Broker callback; If false then Controller does not in/admin/delete_topics
the Watch register, will not respond to the event, the operation at this time Topic only to be recorded without being executed. - Topic operation for creating, Controller from the
/brokers/ids
read current listing of all available Broker, for a Partition in each set_p:
3.1 Replica from all assigned to the Partition (referred to as AR) optionally usable as a Broker New Leader, AR and set a new ISR (because this Topic is newly created, so all of the AR Replica data are not, they are considered to be synchronized, that are in the ISR, any one Replica can serve as Leader)
3.2 the new Leader and write ISR/brokers/topics/[topic]/partitions/[partition]
- LeaderAndISRRequest sent directly to the relevant Broker by RPC.
Topic sequence as shown in FIG created.
1.3 Broker process in response to the request
By Broker kafka.network.SocketServer
and associated module accepts various requests and respond. The entire network communication module based on Java NIO development and use Reactor mode, which comprises an Acceptor is responsible for receiving client requests, N number Processor is responsible for reading and writing data, a Handler M business logic.
Primary responsibility Acceptor is listening and receiving client (the request initiator, including but not limited to Producer, Consumer, Controller, Admin Tool ) connection request, and to establish and client data transmission channel, and then specify a Processor for the client, So far it is the client of the times requested task is over, it can be connected to respond to the next client's requests. Core code is as follows.
Processor is responsible for reading data from the client and returns the response to the client, which itself does not handle specific business logic, and the inside thereof maintains a queue to hold all assigned to it SocketChannel. Processor will run method of a new cycle of extraction from the queue and SocketChannel SelectionKey.OP_READ
registered to the selector, and then the processing cycle is ready to read (requests) and write (response). Processor After reading the data, encapsulates the Request object and to RequestChannel.
Processor and is RequestChannel KafkaRequestHandler local exchange data, which comprises a queue for storing requestQueue added Request Processor, Request KafkaRequestHandler be removed from the inside handle; It also comprises a respondQueue, returned to the customer for storing the processed Request KafkaRequestHandler end of Response.
Processor will requestChannel sequentially stored in responseQueue Response processNewResponses taken out by the method, and the corresponding SelectionKey.OP_WRITE
registration event to the selector. When the selector select the method returns to the write channel can be detected, method call write Response back to the client.
KafkaRequestHandler loop taken from RequestChannel Request and to kafka.server.KafkaApis
address specific business logic.
1.4 LeaderAndIsrRequest response process
For LeaderAndIsrRequest received, Broker primarily by becomeLeaderOrFollower ReplicaManager process, the process is as follows:
- If the request is less than the latest controllerEpoch controllerEpoch current, direct return ErrorMapping.StaleControllerEpochCode.
- For request partitionStateInfos Each element, i.e., ((topic, partitionId), partitionStateInfo):
Leader Epoch 2.1 if partitionStateInfo is greater than the stored current ReplicManager in (topic, partitionId) corresponding partition of the leader epoch, then:
2.1. 1 If the current brokerid (or replica id) in partitionStateInfo, then the partition into a HashMap and partitionStateInfo the named partitionState
2.1.2 otherwise the description of the Partition Broker Replica list is not allocated, the information is recorded log in
2.2 or the corresponding Error code (ErrorMapping.StaleLeaderEpochCode) into the Response - Leader screened partitionState stored in all the records with equal Broker ID partitionsTobeLeader current, the other records stored in partitionsToBeFollower.
- If partitionsTobeLeader is not empty, then its implementation makeLeaders party.
- If partitionsToBeFollower is not empty, then its implementation makeFollowers method.
- If highwatermak thread has not started, it is started, and hwThreadInitialized set to true.
- Close all Fetcher Idle state.
LeaderAndIsrRequest process shown below
1.5 Broker boot process
After the first start Broker ID in accordance with its Zookeeper's /brokers/ids
create a temporary child nodes (under zonde Ephemeral the Node ), created after the success of the Controller ReplicaStateMachine registered on the Broker Change Watch will be fire, thus completing the following steps by KafkaController.onBrokerStartup callback method:
- UpdateMetadataRequest sent to all the newly started Broker, which is defined as follows.
2. All Replica Broker settings on the new start for OnlineReplica state, while the Broker will start high watermark thread for these Partition.
3. By partitionStateMachine trigger OnlinePartitionStateChange.
1.6 Controller Failover
Controller also needs Failover. Each Broker will be in the Controller Path ( /controller
registered on a Watch). When the current Controller failure, the corresponding Controller Path will automatically disappear (because it is Ephemeral Node), at which time the Watch was fire, all "living" Broker will go to compete to be the new Controller (create a new Controller Path), but there will only be a successful campaign (this is guaranteed by Zookeeper). Campaign successful is the new Leader, election losers Watch re-register on the new Controller Path. Because Zookeeper's Watch is disposable after a single failure is fire , so it is necessary to re-register.
Broker will trigger KafkaController.onControllerFailover method after a successful campaign for the new Controller, and complete the following steps in the process:
- Read and increase Controller Epoch.
- In ReassignedPartitions Path (
/admin/reassign_partitions
registered Watch on). - (In PreferredReplicaElection Path
/admin/preferred_replica_election
registering Watch on). - By partitionStateMachine in Broker Topics Patch (
/brokers/topics
registration Watch on). - If
delete.topic.enable
set to true (default is false), the partitionStateMachine (in Delete Topic Patch/admin/delete_topics
Registration Watch on). - By replicaStateMachine in Broker Ids Patch (
/brokers/ids
registration Watch on). - ControllerContext initialize the object, setting all current Topic, "living" Broker list of all Partition Leader ISR and so on.
- Start replicaStateMachine and partitionStateMachine.
- The brokerState status to RunningAsController.
- Leadership will be sent to all information for each Partition "living" Broker.
- If
auto.leader.rebalance.enable
configured as true (default is true), partition-rebalance thread is started. - If
delete.topic.enable
set to true and Topic Patch the Delete (/admin/delete_topics
) has a value, then delete the corresponding Topic.
1.7 Partition reassign
After reallocation management tool Partition issued the request, it sends the corresponding information is written /admin/reassign_partitions
on, which triggers the operation ReassignedPartitionsIsrChangeListener, so as to perform the following operations by executing a callback function KafkaController.onPartitionReassignment:
- The Zookeeper the AR (Current Assigned Replicas) updates the OAR (Original list of replicas for partition) + RAR (Reassigned replicas).
- Zookeeper to update the leader epoch, the AR transmits to each LeaderAndIsrRequest Replica.
- The RAR - OAR is set NewReplica Replica state.
- RAR wait until all the Replica are synchronized with their Leader.
- The RAR all the Replica are set to OnlineReplica state.
- The Cache is set to AR RAR.
- If Leader is not in the RAR, RAR from the re-election of a new Leader and send LeaderAndIsrRequest. If the new Leader of the election not from RAR out, but also the increase in the leader epoch Zookeeper.
- The OAR - all Replica RAR OfflineReplica is set to a state, which includes two parts. First, the ISR of the Zookeeper OAR - RAR Leader removed and sent to notify these Replica LeaderAndIsrRequest been removed from the ISR; second, to OAR - RAR in Replica thereby stopping transmission StopReplicaRequest no longer assigned to the Partition the Replica.
- The OAR - All Replica Set RAR in order to remove it from the disk to NonExistentReplica state.
- The Zookeeper The AR is set to RAR.
- Delete
/admin/reassign_partition
.
Note : The last step before the Zookeeper in AR update, because this is the only place where a persistent storage AR, if Controller crash before this step, the new Controller still be able to continue to complete the process.
The following cases are reassigned Partition, OAR = {1,2,3}, RAR = {4,5,6}, Partition Zookeeper a reallocation of AR and Leader / ISR path follows
1.8 Follower data from Leader Fetch
Follower FetchRequest Get message by sending Leader, FetchRequest following structure
As can be seen from the structure of FetchRequest each Fetch request must specify the maximum waiting time and the minimum number of bytes of acquisition, and the TopicAndPartition Map and PartitionFetchInfo configuration. In fact, from Leader Follower data and data from the Consumer Broker Fetch, the request is completed by FetchRequest, so FetchRequest structure, wherein a field is clientID, and its default value is ConsumerConfig.DefaultClientId.
Fetch requests received after Leader, Kafka by KafkaApis.handleFetchRequest response to the request, the response process is as follows:
- The read data request replicaManager stored in dataRead.
- If the update request from the respective Follower LEO (log end offset) and the corresponding Partition of High Watermark
- The calculated dataRead readable message length (in bytes) and stored in bytesReadable.
- Satisfy the following four conditions are 1, the corresponding data is immediately returned
- Fetch requests not want to wait, i.e. fetchRequest.macWait <= 0
- Fetch requests are not necessarily required to take the message, i.e. fetchRequest.numPartitions <= 0, i.e. empty requestInfo
- Have sufficient data available to return, i.e. bytesReadable> = fetchRequest.minBytes
- An exception occurred when reading data
- If not satisfying the above four conditions, FetchRequest will not return immediately, and the request is packaged into DelayedFetch. Check that the DeplayedFetch meets, if it returns to meet the request, or the request to join the Watch List
Leader by FetchResponse message is returned to the form Follower, FetchResponse following structure
0x02 Replication Tool
2.1 Topic Tool
$KAFKA_HOME/bin/kafka-topics.sh
The tool can be used to create, delete, modify, view a Topic, can also be used to list all Topic. In addition, the tool may be modified as follows.
unclean.leader.election.enable
delete.retention.ms
segment.jitter.ms
retention.ms
flush.ms
segment.bytes
flush.messages
segment.ms
retention.bytes
cleanup.policy
segment.index.bytes
min.cleanable.dirty.ratio
max.message.bytes
file.delete.delay.ms
min.insync.replicas
index.interval.bytes
2.2 Replica Verification Tool
$KAFKA_HOME/bin/kafka-replica-verification.sh
The tool is used to verify that all the specified Replica Topic at one or more corresponding to each Partition are synchronized. By topic-white-list
the need to verify that all parameters specified Topic, support for regular expressions.
2.3 Preferred Replica Leader Election Tool
Use
Once you have Replication mechanism, each Partition may have more than one backup. Replica of a Partition list is called AR (Assigned Replicas), in the first Replica AR is the "Preferred Replica". Create a new Topic or to increase when an existing Topic Partition, Kafka guarantee Preferred Replica is evenly distributed to all Broker cluster. Ideally, Preferred Replica will be chosen as Leader. The above two points to ensure that all the Partition of Leader is evenly distributed among the cluster, which is very important, because all read and write operations completed by the Leader, Leader if the distribution is too concentrated, can cause the cluster load is not balanced. However, with the running of the cluster, the balance may be because of downtime Broker is broken, the tool is used to help restore the balance of Leader of distribution.
In fact, after each Topic recover from failure, it will be set to default Follower role unless all downtime Replica of a Partition, and the current AR Broker is the Partition of the first to recover back Replica. Therefore, after a Partition of Leader (Preferred Replica) downtime and recovery, it is likely no longer be the Partition of Leader, but still Preferred Replica.
principle
- Zookeeper created on the
/admin/preferred_replica_election
node, and stored need to adjust Partition Preferred Replica of information. - Controller has been Watch the node once the node is created, Controller will be notified, and to obtain the content.
- Controller reads Preferred Replica, if it is found that the current is not Replica Leader ISR and it's the Partition, to the Controller transmitted LeaderAndIsrRequest Replica, so that the Replica becomes Leader. If the Replica is not currently Leader, and not in the ISR, Controller in order to ensure that no data is lost, it will not set Leader.
usage
$KAFKA_HOME/bin/kafka-preferred-replica-election.sh --zookeeper localhost:2181
On Kafka cluster consists of eight Broker to create a named topic1, replication-factor is 3, the number of Partition as Topic 8, the use of $KAFKA_HOME/bin/kafka-topics.sh --describe --topic topic1 --zookeeper localhost:2181
command to view its Partition / Replica distribution.
Results shown below, can be seen from the figure, all Kafka Replica uniformly distributed to the entire cluster, and Leader is also evenly distributed.
Manual stop portion Broker, topic1 the Partition / Replica profile shown below. Can be seen from the figure, since the Broker 1/2/4 are stopped, Partition Leader 0 1 changed from the original 3, Partition Leader 1 from the original 2 becomes 5, Partition Leader 2 from the original 3 becomes 6, Partition Leader 3 from the original 4 becomes 7.
ID restart of the Broker 1, topic1 the Partition / Replica distributed as follows. We can see, (ISR Partition 0 and Partition5 of 1) Although the Broker 1 has been started, but not any Parititon 1 of Leader, and Broker 5/6/7 are two Partition of Leader, that is, the distribution of the Leader uneven - a Broker at most 2 Partition of Leader, and the minimum is 0 Partition Leader.
After running the tool, topic1 the Partition / Replica profile shown below. It is seen from the figure, in addition to the Partition. 1 and Partition. 3 and since Broker 2 Broker 4 has not yet started, so that it is not Leader Preferred Repliac, all other Partition is the Leader thereof Preferred Replica. Meanwhile, compared to before running the tool, the more uniform distribution Leader - a Broker at most two of Parittion Leader, is a minimum of Partition Leader.
Start Broker 2 and Broker 4, Leader distribution has not changed compared with the previous step, as shown in FIG.
Run the tool again, all the Partition of Leader by its Preferred Replica bear, Leader more evenly distributed - Each Broker Leader assume the role of a Partition.
In addition to manually run the tool so that a uniform distribution outer Leader, Kafka also provides automatic balancing function assigned Leader, this function can be auto.leader.rebalance.enable
turned on is set to true, it will periodically check whether the balanced distribution Leader, if the imbalance exceeds a certain threshold value Controller will automatically attempt by the Leader of each Partition set to its Preferred Replica. A check period leader.imbalance.check.interval.seconds
specified by the imbalance threshold leader.imbalance.per.broker.percentage
specified.
2.4 Kafka Reassign Partitions Tool
Use
design goal of the tool and Preferred Replica Leader Election Tool somewhat similar, it is designed to facilitate load balancing Kafka cluster. The difference is, Preferred Replica Leader Election in the Leader can adjust its range AR Partition of the Leader distribution, and the tool can also adjust the AR Partition.
Follower Leader Fetch need to keep pace with the data from the Leader, so just keep the balance Leader of distribution for the entire cluster load balancing is not enough. In addition, the production environment, with the increase of the load, you may need to Kafka cluster expansion. Broker cluster to increase Kafka is very simple, but for existing Topic, does not automatically migrate to the new Partition join the Broker, the tool is available at this time for this purpose. In certain scenarios, the actual load may be much smaller than originally anticipated load, the tool will be used at this time Partition distributed over the entire cluster is assigned to a certain heavy machine, Broker then stopping unnecessary to achieve the purpose of saving resources.
Incidentally, the tool can not only adjust the position of the Partition of AR, AR may adjust the number, i.e., changing the replication factor of the Topic.
Principle
This tool is only responsible for the required information into the corresponding node Zookeeper, then quit, is not responsible for specific operations related to the completion of all adjustments by the Controller.
- Zookeeper created on the
/admin/reassign_partitions
node, and stored in the target list of AR target Partition and the corresponding list. - Controller register
/admin/reassign_partitions
on Watch is fire, Controller obtain the list. - All Partition list, Controller will do the following:
- Start
RAR - AR
the Replica, that is newly allocated Replica. (RAR = Reassigned Replicas, AR = Assigned Replicas) - Waiting for the new Replica synchronization with the Leader
- If Leader is not in the RAR elect a new Leader from the RAR
- Stop and delete
AR - RAR
the Replica, Replica that is no longer needed - Delete
/admin/reassign_partitions
Node
Use
the tools used in three modes
- generate mode, given the need to re-allocate Topic, generated automatically reassign plan (not performed)
- execute mode, according to the specified redistribute Partition reassign plan
- verify mode, verify whether the Partition redistribution of success
The following example uses the tool to reassign all Partition Topic to the Broker 4/5/6/7, the following steps:
- The generate mode, generating reassign plan. Specifies reallocation Topic ({ "topics": [ { "topic": "topic1"}], "version": 1}), and stored in
/tmp/topics-to-move.json
a file, and then perform
$KAFKA_HOME/bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --topics-to-move-json-file /tmp/topics-to-move.json --broker-list "4,5,6,7" --generate
Results are shown in FIG.
2. execute mode, execute reassign plan
to step on the reassignment plan generated into /tmp/reassign-plan.json
a file and execute
$KAFKA_HOME/bin/kafka-reassign-partitions.sh
--zookeeper localhost:2181
--reassignment-json-file /tmp/reassign-plan.json --execute
In this case, the Zookeeper /admin/reassign_partitions
node is created, and its value is /tmp/reassign-plan.json
consistent with the contents of the file.
3. Use verify mode, verify reassign is complete. Verify command execution
$KAFKA_HOME/bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --verify --reassignment-json-file /tmp/reassign-plan.json
The results shown below, can be seen from FIG Partititon topic1 are all successful reallocation.
Then again verified by Topic Tool.
bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic topic1
The results shown below, all can be seen from FIG. Partition Topic1 have been reassigned to Broker 4/5/6/7, and consistent with the AR of each Partition reassign plan.
It should be noted that, prior to use execute, is not necessary to generate the generate mode is automatically reassign plan, the generate mode is for convenience only. In fact, in some scenarios, generate patterns generated reassign plan does not necessarily meet the demand, then the user can set their own reassign plan.
2.5 State Change Log Merge Tool
Use
this tool is designed to gather from Broker entire cluster state change log, and generates a set of formatted logs to help diagnose the state of change-related failures. Each Broker will change the state of its received instructions related to the called stored in state-change.log
the log file. In some cases, Partition of Leader Election may be a problem, then we need to change the status of the entire cluster to have a global understanding in order to diagnose and resolve the problem. The cluster tool associated state-change.log
log chronologically combined results support a user enter a time range and outputting the target Topic and Partition as a filter, the final formatted.
usage
bin/kafka-run-class.sh kafka.tools.StateChangeLogMerger --logs /opt/kafka_2.11-0.8.2.1/logs/state-change.log --topic topic1 --partitions 0,1,2,3,4,5,6,7
0x03 reprint