The study notes Kafka Kafka High Availability (lower)

0x00 Summary

  This article on the article, based on a more in-depth explanation of the mechanism of Kafka's HA, HA related mainly elaborated various scenarios, such as Broker failover, Controller failover, Topic creation / deletion, Broker starts, detailed processing Follower Leader fetch data from other processes . At the same time it describes the tools associated with the Replication of Kafka offered, such as redistribution of Partition and so on.

 

0x01 Broker Failover process

1.1 Controller of the processing of Broker Failure

    1. Controller in Zookeeper's /brokers/idsregistration Watch the node. Once Broker downtime (herein represent any let down by Kafka considered Broker die scenarios, including but not limited to power off the machine, the network is unavailable, GC lead to Stop The World, a process crash, etc.), which corresponds to the Zookeeper Znode will be automatically deleted, Zookeeper will fire Controller register Watch, Controller Broker can obtain a list of the latest surviving.
    2. Controller decided set_p, the set contains all the Partition on all Broker downtime.
    3. Each of set_p Partition:
        3.1 From /brokers/topics/[topic]/partitions/[partition]/statereading the Partition current ISR.
        3.2 determine the new Leader of the Partition. If the current ISR at least a Replica also survived, then select one of them as the new Leader, a new current ISR ISR contains all surviving Replica. Otherwise, select the Partition of any one surviving Replica (there may be potential data loss under this scenario) as a new Leader and ISR. If all Replica of the Partition are down, the new Leader will be set to -1.
        3.3 The New Leader, ISR and new leader_epochand controller_epochwrite /brokers/topics/[topic]/partitions/[partition]/state. Note that this operation will be performed only if no change Controller version 3.1 to 3.3 in the process, otherwise jump to 3.1.
    4. RPC commands sent directly LeaderAndISRRequest to set_p relevant Broker. Controller commands can be transmitted in a plurality of RPC operations to improve efficiency.
        Broker failover sequence shown as FIG.

 

 LeaderAndIsrRequest following structure

 

 LeaderAndIsrResponse following structure

 

 

1.2 Create / Delete topic

    1. Controller in Zookeeper's /brokers/topicsregistered node Watch, once a Topic is created or deleted, the Controller will be newly created by Watch / delete Topic of Partition / Replica assignment.
    2. For the delete operation Topic, Topic tool stored in the Topic name /admin/delete_topics. If delete.topic.enabletrue, the Controller registered /admin/delete_topicson Fire Watch is, corresponding to the Controller transmits StopReplicaRequest Broker callback; If false then Controller does not in /admin/delete_topicsthe Watch register, will not respond to the event, the operation at this time Topic only to be recorded without being executed.
    3. Topic operation for creating, Controller from the /brokers/idsread current listing of all available Broker, for a Partition in each set_p:
        3.1 Replica from all assigned to the Partition (referred to as AR) optionally usable as a Broker New Leader, AR and set a new ISR (because this Topic is newly created, so all of the AR Replica data are not, they are considered to be synchronized, that are in the ISR, any one Replica can serve as Leader)
        3.2 the new Leader and write ISR/brokers/topics/[topic]/partitions/[partition]
    4. LeaderAndISRRequest sent directly to the relevant Broker by RPC.
        Topic sequence as shown in FIG created.

 

 

1.3 Broker process in response to the request

  By Broker kafka.network.SocketServerand associated module accepts various requests and respond. The entire network communication module based on Java NIO development and use Reactor mode, which comprises an Acceptor is responsible for receiving client requests, N number Processor is responsible for reading and writing data, a Handler M business logic.
  Primary responsibility Acceptor is listening and receiving client (the request initiator, including but not limited to Producer, Consumer, Controller, Admin Tool ) connection request, and to establish and client data transmission channel, and then specify a Processor for the client, So far it is the client of the times requested task is over, it can be connected to respond to the next client's requests. Core code is as follows.

 

   Processor is responsible for reading data from the client and returns the response to the client, which itself does not handle specific business logic, and the inside thereof maintains a queue to hold all assigned to it SocketChannel. Processor will run method of a new cycle of extraction from the queue and SocketChannel SelectionKey.OP_READregistered to the selector, and then the processing cycle is ready to read (requests) and write (response). Processor After reading the data, encapsulates the Request object and to RequestChannel.
  Processor and is RequestChannel KafkaRequestHandler local exchange data, which comprises a queue for storing requestQueue added Request Processor, Request KafkaRequestHandler be removed from the inside handle; It also comprises a respondQueue, returned to the customer for storing the processed Request KafkaRequestHandler end of Response.
  Processor will requestChannel sequentially stored in responseQueue Response processNewResponses taken out by the method, and the corresponding SelectionKey.OP_WRITEregistration event to the selector. When the selector select the method returns to the write channel can be detected, method call write Response back to the client.
  KafkaRequestHandler loop taken from RequestChannel Request and to kafka.server.KafkaApisaddress specific business logic.

 

1.4 LeaderAndIsrRequest response process

  For LeaderAndIsrRequest received, Broker primarily by becomeLeaderOrFollower ReplicaManager process, the process is as follows:

  1. If the request is less than the latest controllerEpoch controllerEpoch current, direct return ErrorMapping.StaleControllerEpochCode.
  2. For request partitionStateInfos Each element, i.e., ((topic, partitionId), partitionStateInfo):
      Leader Epoch 2.1 if partitionStateInfo is greater than the stored current ReplicManager in (topic, partitionId) corresponding partition of the leader epoch, then:
        2.1. 1 If the current brokerid (or replica id) in partitionStateInfo, then the partition into a HashMap and partitionStateInfo the named partitionState
        2.1.2 otherwise the description of the Partition Broker Replica list is not allocated, the information is recorded log in
      2.2 or the corresponding Error code (ErrorMapping.StaleLeaderEpochCode) into the Response
  3. Leader screened partitionState stored in all the records with equal Broker ID partitionsTobeLeader current, the other records stored in partitionsToBeFollower.
  4. If partitionsTobeLeader is not empty, then its implementation makeLeaders party.
  5. If partitionsToBeFollower is not empty, then its implementation makeFollowers method.
  6. If highwatermak thread has not started, it is started, and hwThreadInitialized set to true.
  7. Close all Fetcher Idle state.

  LeaderAndIsrRequest process shown below

 

 

1.5 Broker boot process

  After the first start Broker ID in accordance with its Zookeeper's /brokers/idscreate a temporary child nodes (under zonde Ephemeral the Node ), created after the success of the Controller ReplicaStateMachine registered on the Broker Change Watch will be fire, thus completing the following steps by KafkaController.onBrokerStartup callback method:

  1. UpdateMetadataRequest sent to all the newly started Broker, which is defined as follows.

 

  2. All Replica Broker settings on the new start for OnlineReplica state, while the Broker will start high watermark thread for these Partition.

  3. By partitionStateMachine trigger OnlinePartitionStateChange.

 

1.6 Controller Failover

  Controller also needs Failover. Each Broker will be in the Controller Path ( /controllerregistered on a Watch). When the current Controller failure, the corresponding Controller Path will automatically disappear (because it is Ephemeral Node), at which time the Watch was fire, all "living" Broker will go to compete to be the new Controller (create a new Controller Path), but there will only be a successful campaign (this is guaranteed by Zookeeper). Campaign successful is the new Leader, election losers Watch re-register on the new Controller Path. Because Zookeeper's Watch is disposable after a single failure is fire , so it is necessary to re-register.

Broker will trigger KafkaController.onControllerFailover method after a successful campaign for the new Controller, and complete the following steps in the process:

  1. Read and increase Controller Epoch.
  2. In ReassignedPartitions Path ( /admin/reassign_partitionsregistered Watch on).
  3. (In PreferredReplicaElection Path /admin/preferred_replica_electionregistering Watch on).
  4. By partitionStateMachine in Broker Topics Patch ( /brokers/topicsregistration Watch on).
  5. If delete.topic.enableset to true (default is false), the partitionStateMachine (in Delete Topic Patch /admin/delete_topicsRegistration Watch on).
  6. By replicaStateMachine in Broker Ids Patch ( /brokers/idsregistration Watch on).
  7. ControllerContext initialize the object, setting all current Topic, "living" Broker list of all Partition Leader ISR and so on.
  8. Start replicaStateMachine and partitionStateMachine.
  9. The brokerState status to RunningAsController.
  10. Leadership will be sent to all information for each Partition "living" Broker.
  11. If auto.leader.rebalance.enableconfigured as true (default is true), partition-rebalance thread is started.
  12. If delete.topic.enableset to true and Topic Patch the Delete ( /admin/delete_topics) has a value, then delete the corresponding Topic.

 

1.7 Partition reassign

After reallocation management tool Partition issued the request, it sends the corresponding information is written /admin/reassign_partitionson, which triggers the operation ReassignedPartitionsIsrChangeListener, so as to perform the following operations by executing a callback function KafkaController.onPartitionReassignment:

    1. The Zookeeper the AR (Current Assigned Replicas) updates the OAR (Original list of replicas for partition) + RAR (Reassigned replicas).
    2. Zookeeper to update the leader epoch, the AR transmits to each LeaderAndIsrRequest Replica.
    3. The RAR - OAR is set NewReplica Replica state.
    4. RAR wait until all the Replica are synchronized with their Leader.
    5. The RAR all the Replica are set to OnlineReplica state.
    6. The Cache is set to AR RAR.
    7. If Leader is not in the RAR, RAR from the re-election of a new Leader and send LeaderAndIsrRequest. If the new Leader of the election not from RAR out, but also the increase in the leader epoch Zookeeper.
    8. The OAR - all Replica RAR OfflineReplica is set to a state, which includes two parts. First, the ISR of the Zookeeper OAR - RAR Leader removed and sent to notify these Replica LeaderAndIsrRequest been removed from the ISR; second, to OAR - RAR in Replica thereby stopping transmission StopReplicaRequest no longer assigned to the Partition the Replica.
    9. The OAR - All Replica Set RAR in order to remove it from the disk to NonExistentReplica state.
    10. The Zookeeper The AR is set to RAR.
    11. Delete /admin/reassign_partition.
        
      Note : The last step before the Zookeeper in AR update, because this is the only place where a persistent storage AR, if Controller crash before this step, the new Controller still be able to continue to complete the process.
        The following cases are reassigned Partition, OAR = {1,2,3}, RAR = {4,5,6}, Partition Zookeeper a reallocation of AR and Leader / ISR path follows

 

 

1.8 Follower data from Leader Fetch

  Follower FetchRequest Get message by sending Leader, FetchRequest following structure

 

   As can be seen from the structure of FetchRequest each Fetch request must specify the maximum waiting time and the minimum number of bytes of acquisition, and the TopicAndPartition Map and PartitionFetchInfo configuration. In fact, from Leader Follower data and data from the Consumer Broker Fetch, the request is completed by FetchRequest, so FetchRequest structure, wherein a field is clientID, and its default value is ConsumerConfig.DefaultClientId.

   Fetch requests received after Leader, Kafka by KafkaApis.handleFetchRequest response to the request, the response process is as follows:

  1. The read data request replicaManager stored in dataRead.
  2. If the update request from the respective Follower LEO (log end offset) and the corresponding Partition of High Watermark
  3. The calculated dataRead readable message length (in bytes) and stored in bytesReadable.
  4. Satisfy the following four conditions are 1, the corresponding data is immediately returned
  • Fetch requests not want to wait, i.e. fetchRequest.macWait <= 0
  • Fetch requests are not necessarily required to take the message, i.e. fetchRequest.numPartitions <= 0, i.e. empty requestInfo
  • Have sufficient data available to return, i.e. bytesReadable> = fetchRequest.minBytes
  • An exception occurred when reading data
  1. If not satisfying the above four conditions, FetchRequest will not return immediately, and the request is packaged into DelayedFetch. Check that the DeplayedFetch meets, if it returns to meet the request, or the request to join the Watch List

  Leader by FetchResponse message is returned to the form Follower, FetchResponse following structure

 

 

0x02 Replication Tool

2.1 Topic Tool

  $KAFKA_HOME/bin/kafka-topics.shThe tool can be used to create, delete, modify, view a Topic, can also be used to list all Topic. In addition, the tool may be modified as follows.

unclean.leader.election.enable
delete.retention.ms
segment.jitter.ms
retention.ms
flush.ms
segment.bytes
flush.messages
segment.ms
retention.bytes
cleanup.policy
segment.index.bytes
min.cleanable.dirty.ratio
max.message.bytes
file.delete.delay.ms
min.insync.replicas
index.interval.bytes

 

2.2 Replica Verification Tool

$KAFKA_HOME/bin/kafka-replica-verification.shThe tool is used to verify that all the specified Replica Topic at one or more corresponding to each Partition are synchronized. By topic-white-listthe need to verify that all parameters specified Topic, support for regular expressions.

 

2.3 Preferred Replica Leader Election Tool

Use
  Once you have Replication mechanism, each Partition may have more than one backup. Replica of a Partition list is called AR (Assigned Replicas), in the first Replica AR is the "Preferred Replica". Create a new Topic or to increase when an existing Topic Partition, Kafka guarantee Preferred Replica is evenly distributed to all Broker cluster. Ideally, Preferred Replica will be chosen as Leader. The above two points to ensure that all the Partition of Leader is evenly distributed among the cluster, which is very important, because all read and write operations completed by the Leader, Leader if the distribution is too concentrated, can cause the cluster load is not balanced. However, with the running of the cluster, the balance may be because of downtime Broker is broken, the tool is used to help restore the balance of Leader of distribution.
  In fact, after each Topic recover from failure, it will be set to default Follower role unless all downtime Replica of a Partition, and the current AR Broker is the Partition of the first to recover back Replica. Therefore, after a Partition of Leader (Preferred Replica) downtime and recovery, it is likely no longer be the Partition of Leader, but still Preferred Replica.
  
principle

  1. Zookeeper created on the /admin/preferred_replica_electionnode, and stored need to adjust Partition Preferred Replica of information.
  2. Controller has been Watch the node once the node is created, Controller will be notified, and to obtain the content.
  3. Controller reads Preferred Replica, if it is found that the current is not Replica Leader ISR and it's the Partition, to the Controller transmitted LeaderAndIsrRequest Replica, so that the Replica becomes Leader. If the Replica is not currently Leader, and not in the ISR, Controller in order to ensure that no data is lost, it will not set Leader.  

usage

$KAFKA_HOME/bin/kafka-preferred-replica-election.sh --zookeeper localhost:2181

  On Kafka cluster consists of eight Broker to create a named topic1, replication-factor is 3, the number of Partition as Topic 8, the use of $KAFKA_HOME/bin/kafka-topics.sh --describe --topic topic1 --zookeeper localhost:2181command to view its Partition / Replica distribution.

  Results shown below, can be seen from the figure, all Kafka Replica uniformly distributed to the entire cluster, and Leader is also evenly distributed.

 

   Manual stop portion Broker, topic1 the Partition / Replica profile shown below. Can be seen from the figure, since the Broker 1/2/4 are stopped, Partition Leader 0 1 changed from the original 3, Partition Leader 1 from the original 2 becomes 5, Partition Leader 2 from the original 3 becomes 6, Partition Leader 3 from the original 4 becomes 7.

 

   ID restart of the Broker 1, topic1 the Partition / Replica distributed as follows. We can see, (ISR Partition 0 and Partition5 of 1) Although the Broker 1 has been started, but not any Parititon 1 of Leader, and Broker 5/6/7 are two Partition of Leader, that is, the distribution of the Leader uneven - a Broker at most 2 Partition of Leader, and the minimum is 0 Partition Leader.

 

   After running the tool, topic1 the Partition / Replica profile shown below. It is seen from the figure, in addition to the Partition. 1 and Partition. 3 and since Broker 2 Broker 4 has not yet started, so that it is not Leader Preferred Repliac, all other Partition is the Leader thereof Preferred Replica. Meanwhile, compared to before running the tool, the more uniform distribution Leader - a Broker at most two of Parittion Leader, is a minimum of Partition Leader.

 

   Start Broker 2 and Broker 4, Leader distribution has not changed compared with the previous step, as shown in FIG.

 

   Run the tool again, all the Partition of Leader by its Preferred Replica bear, Leader more evenly distributed - Each Broker Leader assume the role of a Partition.
  
  In addition to manually run the tool so that a uniform distribution outer Leader, Kafka also provides automatic balancing function assigned Leader, this function can be auto.leader.rebalance.enableturned on is set to true, it will periodically check whether the balanced distribution Leader, if the imbalance exceeds a certain threshold value Controller will automatically attempt by the Leader of each Partition set to its Preferred Replica. A check period leader.imbalance.check.interval.secondsspecified by the imbalance threshold leader.imbalance.per.broker.percentagespecified.

 

2.4 Kafka Reassign Partitions Tool

Use
  design goal of the tool and Preferred Replica Leader Election Tool somewhat similar, it is designed to facilitate load balancing Kafka cluster. The difference is, Preferred Replica Leader Election in the Leader can adjust its range AR Partition of the Leader distribution, and the tool can also adjust the AR Partition.
  Follower Leader Fetch need to keep pace with the data from the Leader, so just keep the balance Leader of distribution for the entire cluster load balancing is not enough. In addition, the production environment, with the increase of the load, you may need to Kafka cluster expansion. Broker cluster to increase Kafka is very simple, but for existing Topic, does not automatically migrate to the new Partition join the Broker, the tool is available at this time for this purpose. In certain scenarios, the actual load may be much smaller than originally anticipated load, the tool will be used at this time Partition distributed over the entire cluster is assigned to a certain heavy machine, Broker then stopping unnecessary to achieve the purpose of saving resources.
  Incidentally, the tool can not only adjust the position of the Partition of AR, AR may adjust the number, i.e., changing the replication factor of the Topic.
  
Principle
  This tool is only responsible for the required information into the corresponding node Zookeeper, then quit, is not responsible for specific operations related to the completion of all adjustments by the Controller.

  1. Zookeeper created on the /admin/reassign_partitionsnode, and stored in the target list of AR target Partition and the corresponding list.
  2. Controller register /admin/reassign_partitionson Watch is fire, Controller obtain the list.
  3. All Partition list, Controller will do the following:
  • Start RAR - ARthe Replica, that is newly allocated Replica. (RAR = Reassigned Replicas, AR = Assigned Replicas)
  • Waiting for the new Replica synchronization with the Leader
  • If Leader is not in the RAR elect a new Leader from the RAR
  • Stop and delete AR - RARthe Replica, Replica that is no longer needed
  • Delete /admin/reassign_partitionsNode

Use
  the tools used in three modes

  • generate mode, given the need to re-allocate Topic, generated automatically reassign plan (not performed)
  • execute mode, according to the specified redistribute Partition reassign plan
  • verify mode, verify whether the Partition redistribution of success

  The following example uses the tool to reassign all Partition Topic to the Broker 4/5/6/7, the following steps:

    1. The generate mode, generating reassign plan. Specifies reallocation Topic ({ "topics": [ { "topic": "topic1"}], "version": 1}), and stored in /tmp/topics-to-move.jsona file, and then perform
$KAFKA_HOME/bin/kafka-reassign-partitions.sh 
    --zookeeper localhost:2181 
    --topics-to-move-json-file /tmp/topics-to-move.json  
    --broker-list "4,5,6,7" --generate

Results are shown in FIG.

 

     2. execute mode, execute reassign plan
      to step on the reassignment plan generated into /tmp/reassign-plan.jsona file and execute

$KAFKA_HOME/bin/kafka-reassign-partitions.sh 
--zookeeper localhost:2181     
--reassignment-json-file /tmp/reassign-plan.json --execute

 

 In this case, the Zookeeper /admin/reassign_partitionsnode is created, and its value is /tmp/reassign-plan.jsonconsistent with the contents of the file.

 

     3. Use verify mode, verify reassign is complete. Verify command execution

$KAFKA_HOME/bin/kafka-reassign-partitions.sh 
--zookeeper localhost:2181 --verify
--reassignment-json-file /tmp/reassign-plan.json

The results shown below, can be seen from FIG Partititon topic1 are all successful reallocation.

 

   Then again verified by Topic Tool.

bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic topic1

  The results shown below, all can be seen from FIG. Partition Topic1 have been reassigned to Broker 4/5/6/7, and consistent with the AR of each Partition reassign plan.

 

   It should be noted that, prior to use execute, is not necessary to generate the generate mode is automatically reassign plan, the generate mode is for convenience only. In fact, in some scenarios, generate patterns generated reassign plan does not necessarily meet the demand, then the user can set their own reassign plan. 

 

2.5 State Change Log Merge Tool

Use
  this tool is designed to gather from Broker entire cluster state change log, and generates a set of formatted logs to help diagnose the state of change-related failures. Each Broker will change the state of its received instructions related to the called stored in state-change.logthe log file. In some cases, Partition of Leader Election may be a problem, then we need to change the status of the entire cluster to have a global understanding in order to diagnose and resolve the problem. The cluster tool associated state-change.loglog chronologically combined results support a user enter a time range and outputting the target Topic and Partition as a filter, the final formatted.
  
usage

bin/kafka-run-class.sh kafka.tools.StateChangeLogMerger
--logs /opt/kafka_2.11-0.8.2.1/logs/state-change.log
--topic topic1 --partitions 0,1,2,3,4,5,6,7

 

0x03 reprint

 

Guess you like

Origin www.cnblogs.com/JetpropelledSnake/p/11611404.html