Controller mechanism

Controller mechanism

1 Introduction

Controller is selected from Kafka a cluster of broker, responsible for change management topic partitions and a copy of the state, and perform administrative tasks reallocation partitions and the like.

 

 

The first launch of the broker will become a controller, it will create a temporary node (ephemeral) on Zookeeper: / controller. Other after starting the broker also try to create such a temporary node, but will complain, this time the broker would create a monitor (Watch) on / controller node of the zookeeper, so that when the node status change occurs (for example: to be deleted time), the broker will be notified. In this case, the broker will be notified at the time, continued to create the node. Ensure that the cluster has been a controller node.

 

When the broker node controller where downtime or disconnection and Zookeeper, which is created on a temporary node Zookeeper will be automatically deleted. On the other nodes are installed to monitor broker nodes are notified, at this time, the broker will try to create such a temporary / controller nodes, but only one of them broker (the first created) to create a successful, other broker will complain: node already exists, change received the wrong broker nodes will install a temporary watch again on the node to monitor the node status. Each time a broker is elected, will be given to a larger number (incremented by zookeeper conditions of realization), so that other node controller will know the current figure.

 

When a broker is down from time to time in the current cluster Kafka, controller will be notified (via the path monitoring zookeeper), if the main topic of just some partition on the broker, at this time the controller will re-select the primary partition. controller will check all partitions no leader, and to decide who is the new leader (the simplest way is: select the next partition copy of the partition), and sends a request to all the broker.

 

The new leader to guide each partition, producers and consumers will receive from the client's request. While also guiding follower, you should start copying the message from the new leader.

When a new node joins the cluster broker, controller will check whether there is a partition copy on the broker. If there is, controller notice the change of new and existing broker, the broker starts copying the message from the leader of the.

The following relevant principles from the following description of the controller:

2 controller to start

2.1 Election controller

In KafkaServer.startup () in, KafkaController object is constructed, after starting KafkaApis, replicaManager, KafkaController.startup()is called.

 

 

startup()Function is very simple, glued directly to the code here:

 

 

Remove isRunning assignment logs, and identifies the state, it is worth looking at the code on two. Which registerSessionExpirationListener () is used at the time of reconnection session expires after zookeeper zookeeper cancel the registration on various Listener, and controllerElector.startup the start of the election, which will take place in ZookeeperLeaderElector class.

 

 

Kafka Broker will call each cluster startup () function, but only a cluster can become a Broker Controller. So, who will become the controller of it? 

KafkaController elections are implemented directly by zookeeper, it is to create a temporary directory / controller / zookeeper and stored in the current brokerId in the directory. If you create a path in the zookeeper did not throw an exception ZkNodeExistsException, the current broker successfully promoted to Controller. In addition to calling elect outside, controllerElector.startup will register on the / controller / path Listener, listening dataChange events and dataDelete event, when the / controller / under data changes, expressed Controller has changed; and because data / controller / under for temporary data, occurs when the Controller failover, data will be deleted, dataDelete trigger event, then we need to re-elect a new term Controller.

 

 

2.1 Registration listener

 

 

After becoming KafkaController very important thing, is to add Listener zookeeper on each critical path, so here it is necessary to summarize the path associated with the controller ( [] indicate where the value is a function of the actual situation changes):

  • l / controller / { "brokerid": "1"}: decide who is currently stored under this term brokerId Controller Controller, path information is stored in the form of temporary data will be deleted when the session expires. DataChange and dataDelete class LeaderChangeListener monitor events in the path.
  • l / brokers / topics: topic list for all subdirectories. Class TopicChangeListener monitor changes in the list of subdirectories, if any new topic, the call onNewTopicCreation create a new topic.
  • l  /brokers/topics/[topic]/:存放的是topic下各个分区的AR,目录下存放的格式为{“partitions”:{“partitionId1”:[broker1,broker2], …}}。类AddPartitionsListener监听路径下的数据变化,在有新增partition时调用Controller.onNewPartitionCreate,即创建新的partition。
  • l  /brokers/topics/[topic]/partitions/[partitionId]/state/:存放的是各个分区的leaderAndIsr信息,即各个分区当前的leaderId,以及ISR。类ReassignedPartitionsIsrChangeListener监听该路径下的数据变化,在重新分配replica到partition时,需要等待新的replica追赶上leader后才能执行后续操作。
  • l  /brokers/ids/[brokerId]/brokerInfoString:存放broker信息,brokerInfoString包括broker的IP,端口等信息;BrokerChangeListener监听/brokers/ids/下子目录变化,从而通知Controller broker的上下线消息。brokerInfoString是Controller判断的broker是否活着的条件之一,controllerContext中的liveBrokers需要相应路径下能够获取到brokerInfo。
  • l  /admin/reassign_partitions:指导重新分配AR的路径,通过命令修改AR时会写入到这个路径下。类PartitionsReassignedListener监听该路径下的内容变化,调用initiateReassignReplicasForTopicPartition,执行重新分配AR操作。
  • l  /admin/preferred_replica_election:分区需要重新选举第一个replica作为leader,即所谓的preferred replica。类PreferredReplicaElectionListener监听该路径,并对路径下的partitions执行重新选举preferred replica作为leader。

名词解释

  • l  AR 当前已分配的副本列表
  • l  RAR 重分配过的副本列表
  • l  ORA 重分配之前的副本列表
  • l  分区Leader 给定分区负责客户端读写的结点
  • l  ISR “in-sync” replicas,能够与Leader保持同步的副本集合(Leader也在ISR中)

 

成为KafkaController以后,会执行什么操作呢?

  1. 升级Controller Epoch,并将新的epoch写入到zookeeper中;新的epoch标识着下一个时代的leader,向其他broker发送命令时会校验epoch;
  2. 监听zookeeper路径/admin/reassign_partitions;
  3. 监听zookeeper路径/admin/preferred_replica_election;
  4. 注册partition状态机中的监听器,监听路径/brokers/topics的子目录变化,随时准备创建topic;
  5. 注册replica状态机中的监听器,监听路径/brokers/ids/的子目录,以便在新的broker加入时能够感知到;
  6. 初始化ControllerContext,主要是从zookeeper中读取数据初始化context中的变量,诸如 liveBrokers,allTopics,AR,LeadershipInfo等;
  7. 初始化ReplicaStateMachine,将所有在活跃broker上的replica的状态变为OnlineReplica;
  8. 初始化PartitionStateMachine,将所有leader在活跃broker上的partition的状态设置为Onlinepartition;其他的partition状态为OfflinePartition。Partition是否为Online的标识就是leader是否活着;之后还会触发OfflinePartition 和 NewPartition向OnlinePartition转变,因为OfflinePartition和NewPartition可能是选举leader不成功,所以没有成为OnlinePartition,在环境变化后需要重新触发;
  9. 在所有的topic的zookeeper路径/brokers/topics/[topic]/上添加AddPartitionsListener,监听partition变化;
  10. KafkaController 启动后,触发一次最优 leader 选举操作,如果需要的情况下;
  11. KafkaController 启动后,如果开启了自动 leader 均衡,启动自动 leader 均衡线程,它会根据配置的信息定期运行。

 

完成对各个zookeeper路径的监听后,zookeeper内容的变化驱动Controller进行各种操作,处理如新建topic,删除topic,broker失效,broker恢复等事件。

2.3 Controller Failover

前面startup()中registerSessionExpirationListener()会注册会话监听器,在zookeeper会话过期后又重连成功时调用onControllerResignation(),并重新执行选举操作。 此外,当Controller会话失效时,会删除/controller/路径下创建的临时数据。与此同时,其他broker上的ZookeeperLeaderElector类中的LeaderChangeListener感知到数据删除后会重新执行选举。

onControllerResignation()是Controller转变为普通broker时执行的操作,就是将前面注册的各个Listener取消注册,不再关注zookeeper变化

2.4 initializeControllerContext 初始化 Controller 上下文信息

在 KafkaController 中

l  有两个状态机:分区状态机和副本状态机;

l  一个管理器:Channel 管理器,负责管理所有的 Broker 通信;

l  相关缓存:Partition 信息、Topic 信息、broker id 信息等;

l  四种 leader 选举机制:分别是用 leader offline、broker 掉线、partition reassign、最优 leader 选举时触发;

 

 

 

在 initializeControllerContext() 初始化 KafkaController 上下文信息的方法中,主要做了以下事情:

  • l  从 zk 获取所有 alive broker 列表,记录到 liveBrokers;
  • l  从 zk 获取所有的 topic 列表,记录到 allTopic 中;
  • l  从 zk 获取所有 Partition 的 replica 信息,更新到 partitionReplicaAssignment 中;
  • l  从 zk 获取所有 Partition 的 LeaderAndIsr 信息,更新到 partitionLeadershipInfo 中;
  • l  调用 startChannelManager() 启动 Controller 的 Channel Manager;
  • l  通过 initializePreferredReplicaElection() 初始化需要最优 leader 选举的 Partition 列表,记录到 partitionsUndergoingPreferredReplicaElection 中;
  • l  通过 initializePartitionReassignment() 方法初始化需要进行副本迁移的 Partition 列表,记录到 partitionsBeingReassigned 中;
  • l  通过 initializeTopicDeletion() 方法初始化需要删除的 topic 列表及 TopicDeletionManager 对象;

最优 leader 选举:就是默认选择 Replica 分配中第一个 replica 作为 leader,为什么叫做最优 leader 选举呢?因为 Kafka 在给每个 Partition 分配副本时,它会保证分区的主副本会均匀分布在所有的 broker 上,这样的话只要保证第一个 replica 被选举为 leader,读写流量就会均匀分布在所有的 Broker 上,当然这是有一个前提的,那就是每个 Partition 的读写流量相差不多,但是在实际的生产环境,这是不太可能的,所以一般情况下,大集群是不建议开自动 leader 均衡的,可以通过额外的算法计算、手动去触发最优 leader 选举。

2.5 Controller Channel Manager

initializeControllerContext() 方法会通过 startChannelManager() 方法初始化 ControllerChannelManager 对象,如下所示:

 

 

ControllerChannelManager在初始化时,会为集群中的每个节点初始化一个 ControllerBrokerStateInfo 对象,该对象包含四个部分:

  • l  NetworkClient:网络连接对象;
  • l  Node:节点信息;
  • l  BlockingQueue:请求队列;
  • l  RequestSendThread:请求的发送线程。

其具体实现如下所示:

 

 

清楚了上面的逻辑,再来看 KafkaController 部分是如何向 Broker 发送请求的

 

 

KafkaController 实际上是调用的 ControllerChannelManager 的 sendRequest() 方法向 Broker 发送请求信息,其实现如下所示:

 

 

它实际上只是把对应的请求添加到该 Broker 对应的 MessageQueue 中,并没有真正的去发送请求,请求的的发送是在 每台 Broker 对应的 RequestSendThread 中处理的。

2.6 Controller 原生的四种 leader 选举机制

四种 leader 选举实现类及对应触发条件如下所示

实现

触发条件

OfflinePartitionLeaderSelector

leader 掉线时触发

ReassignedPartitionLeaderSelector

分区的副本重新分配数据同步完成后触发的

PreferredReplicaPartitionLeaderSelector

最优 leader 选举,手动触发或自动 leader 均衡调度时触发

ControlledShutdownLeaderSelector

broker 发送 ShutDown 请求主动关闭服务时触发

OfflinePartitionLeaderSelector

选举的逻辑是:

  • l  如果 isr 中至少有一个副本是存活的,那么从该 Partition 存活的 isr 中选举第一个副本作为新的 leader,存活的 isr 作为新的 isr;
  • l  否则,如果脏选举(unclear elect)是禁止的,那么就抛出 NoReplicaOnlineException 异常;
  • l  否则,即允许脏选举的情况下,从存活的、所分配的副本(不在 isr 中的副本)中选出一个副本作为新的 leader 和新的 isr 集合;
  • l  否则,即是 Partition 分配的副本没有存活的,抛出 NoReplicaOnlineException 异常;

一旦 leader 被成功注册到 zk 中,它将会更新到 KafkaController 缓存中的 allLeaders 中。

 

 

ReassignedPartitionLeaderSelector

ReassignedPartitionLeaderSelector 是在 Partition 副本迁移后,副本同步完成(RAR 都处在 isr 中,RAR 指的是该 Partition 新分配的副本)后触发的,其 leader 选举逻辑如下:

  • l  leader 选择存活的 RAR 中的第一个副本,此时 RAR 都在 isr 中了;
  • l  new isr 是所有存活的 RAR 副本列表;

 

 

PreferredReplicaPartitionLeaderSelector

PreferredReplicaPartitionLeaderSelector 是最优 leader 选举,选择 AR(assign replica)中的第一个副本作为 leader,前提是该 replica 在是存活的、并且在 isr 中,否则会抛出 StateChangeFailedException 的异常。

 

 

 

ControlledShutdownLeaderSelector

ControlledShutdownLeaderSelector 是在处理 broker 下线时调用的 leader 选举方法,它会选举 isr 中第一个没有正在关闭的 replica 作为 leader,否则抛出 StateChangeFailedException 异常。

 

 

参考资料:

https://blog.csdn.net/zg_hover/article/details/81672997

https://blog.csdn.net/c395318621/article/details/52463854

https://github.com/wangzzu/awesome/issues/7

 

Guess you like

Origin www.cnblogs.com/zhy-heaven/p/10994144.html