Kafka源码阅读-Controller(一)

源码版本:0.10.2.x

引用 Kafka Controller Internals:

In a Kafka cluster, one of the brokers serves as the controller, which is responsible for managing the states of
partitions and replicas and for performing administrative tasks like reassigning partitions.

集群中的broker尝试在Zookeeper的’/controller’路径下创建一个临时节点,先创建成功者成为Controller。

Kafka 架构图
在这里插入图片描述

Kafka Controller主要内容:

  • 集群启动时,KafkaServerStartable启动KafkaServer, KafkaServer启动KafkaController,随即开始Controller的选举。
  • 在Controller选举后,选举成功的Controller被激活,注册Zookeeper的watch,开始监听节点以便随时处理Controller的Failover。然后是各种状态改变的处理,包括:
  • 新创建的topic或已有topic的分区扩容,重新分配分区的副本、分区副本leader的选举;
  • 处理Broker的启动和下线;One broker in a cluster acts as controller • Monitor the liveness of brokers • Elect new leaders on broker failure • Communicate new leaders to brokers
  • 管理副本状态机replicaStateMachine和分区状态机partitionStateMachine。

可见,Controller在集群中具有举足轻重的地位。

1. Controller的启动

KafkaServerStartable 管理一个单独的KafkaServer实例,负责KafkaServer的启动、关闭、setServerState和awaitShutdown.

KafkaServe KafkaServer KafkaController ZKLeaderElector ZKChecked startup new, construct startup 1.注册SessionExpirationListener 2.启动controllerElector startup 1.在"/controller"路径订阅DataChanges 注册LeaderChangeListener 处理自动重选举。 2.选举 elect amILeader KafkaServe KafkaServer KafkaController ZKLeaderElector ZKChecked

图中的ZKLeaderElector是指ZookeeperLeaderElector对象,它负责所有Leader的选举,通过指定electionPath: String参数。

2. Controller的选举

Controller启动时初始化并启动ZookeeperLeaderElector。通过ZookeeperLeaderElector类在Controller的electionPath也就是"/controller"路径上注册leaderChangeListener,节点的数据变化时就会通知leaderChangeListener进行相应的处理。
ZookeeperLeaderElector类:This class handles zookeeper based leader election based on an ephemeral path.
节点的数据发生变化时就会通知LeaderChangeListener对象。当前Controller Fail时,对应的Controller Path会自动消失,所有“活着”的Broker都会去竞选成为新的Controller(创建新的Controller Path)。

  // ZookeeperLeaderElector
  def startup {
    inLock(controllerContext.controllerLock) {
      controllerContext.zkUtils.zkClient.subscribeDataChanges(electionPath, leaderChangeListener)
      elect
    }
  }

接下来看elect:

Controller通过在zookeeper上的"/controller"路径创建临时节点来实现Controller选举,并在该节点中写入当前broker的信息:

  {“version”:1,”brokerid”:1,”timestamp”:”1512018424988”}

利用Zookeeper的强一致性特性,一个节点只能被一个客户端创建成功,创建成功的broker即为Controller。
如果Controller已经选举成功,此处参与选举的broker会直接返回amILeader,避免被选为Controller的broker进入创建节点的死循环。

def amILeader : Boolean = leaderId == brokerId

  // ZookeeperLeaderElector
  def elect: Boolean = {
    val timestamp = time.milliseconds.toString
    val electString = Json.encode(Map("version" -> 1, "brokerid" -> brokerId, "timestamp" -> timestamp))
   
   leaderId = getControllerID 
    /* 
     * We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition, 
     * it's possible that the controller has already been elected when we get here. This check will prevent the following 
     * createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
     */
    if(leaderId != -1) {
       debug("Broker %d has been elected as leader, so stopping the election process.".format(leaderId))
       return amILeader
    }

    try {
      val zkCheckedEphemeral = new ZKCheckedEphemeral(electionPath,
                                                      electString,
                                                      controllerContext.zkUtils.zkConnection.getZookeeper,
                                                      JaasUtils.isZkSecurityEnabled())
      zkCheckedEphemeral.create()
      info(brokerId + " successfully elected as leader")
      leaderId = brokerId //记录leader broker的id,成为controller
      onBecomingLeader()
    } catch {
      case _: ZkNodeExistsException =>
        // If someone else has written the path, then
        leaderId = getControllerID 

        if (leaderId != -1)
          debug("Broker %d was elected as leader instead of broker %d".format(leaderId, brokerId))
        else
          warn("A leader has been elected but just resigned, this will result in another round of election")

      case e2: Throwable =>
        error("Error while electing or becoming leader on broker %d".format(brokerId), e2)
        resign()
    }
    amILeader
  }

选举路径下的节点创建成功后,elect方法记录LeaderID,调用onBecomingLeader(),该函数来自ZookeeperLeaderElector类的第三个函数参数:

class ZookeeperLeaderElector(controllerContext: ControllerContext,
                             electionPath: String,
                             onBecomingLeader: () => Unit,
                             onResigningAsLeader: () => Unit,
                             brokerId: Int,
                             time: Time)

Controller初始化时传入的其实就是onControllerFailover:

  private val controllerElector = new ZookeeperLeaderElector(controllerContext, ZkUtils.ControllerPath, 
    onControllerFailover,
    onControllerResignation, config.brokerId, time)

这表示一旦成为Controller,就时刻准备着Failover,才能高可用。

3. Controller的故障转移

成为controller后会触发第一次onControllerFailover。以后controller节点被删除了,也会重新选举,
并在elect中回调onControllerFailover实现故障转移,该函数

  1. 注册controller epoch改变的listener;
  2. 增加controller的Epoch;
  3. 初始化ControllerContext对象,它持有当前topics, live brokers和所有现存partitions的leaders信息.
  4. 启动controller的ChannelManager
  5. 启动replica的状态机和partition的状态机
  6. 可能触发Partition重分配,以及PreferredReplica的重选举等

如果在Controller运行中遭遇任何异常或错误,它将退出当前的controller,这确保了另一次controller election被触发,并且总是只有一个controller在服务。

  def onControllerFailover() {
    if(isRunning) {
      info("Broker %d starting become controller state transition".format(config.brokerId))
      readControllerEpochFromZookeeper()
      incrementControllerEpoch(zkUtils.zkClient)

      // before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks
      registerReassignedPartitionsListener()
      registerIsrChangeNotificationListener()
      registerPreferredReplicaElectionListener()
      partitionStateMachine.registerListeners()
      replicaStateMachine.registerListeners()

      // 更新ControllerContext
      initializeControllerContext()

      // We need to send UpdateMetadataRequest after the controller context is initialized and before the state machines
      // are started. The is because brokers need to receive the list of live brokers from UpdateMetadataRequest before
      // they can process the LeaderAndIsrRequests that are generated by replicaStateMachine.startup() and
      // partitionStateMachine.startup().
      sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)

      replicaStateMachine.startup()
      partitionStateMachine.startup()

      // register the partition change listeners for all existing topics on failover
      controllerContext.allTopics.foreach(topic => partitionStateMachine.registerPartitionChangeListener(topic))
      info("Broker %d is ready to serve as the new controller with epoch %d".format(config.brokerId, epoch))
      maybeTriggerPartitionReassignment()
      maybeTriggerPreferredReplicaElection()
      if (config.autoLeaderRebalanceEnable) {
        info("starting the partition rebalance scheduler")
        autoRebalanceScheduler.startup()
        autoRebalanceScheduler.schedule("partition-rebalance-thread", checkAndTriggerPartitionRebalance,
          5, config.leaderImbalanceCheckIntervalSeconds.toLong, TimeUnit.SECONDS)
      }
      deleteTopicManager.start()
    }
    else
      info("Controller has been shut down, aborting startup/failover")
  }
Controller zkUtils ControllerContext PartitionStateMachin ReplicaStateMachin ControllerBrok 从"/controller_epoch"读取Epoch信息,if exist conditionalUpdatePersistentPath: 使epoch + 1 更新epoch和epochZkVersion 注册Listener "$AdminPath/reassign_partitions" 注册Listener "/isr_change_notification" 注册Listener "$AdminPath/preferred_replica_election" 注册TopicChangeListener 注册DeleteTopicListener 注册BrokerChangeListener 更新上下文(liveBrokers,allTopics,partitionReplica等) 重新创建和启动ControllerChannelManager 发送更新live Brokers请求 启动 初始化replica state, set started flag move all Online replicas 状态 to Online 注册PartitionModificationsListener Controller zkUtils ControllerContext PartitionStateMachin ReplicaStateMachin ControllerBrok

一旦触发Controller的Failover,Controller根据从ZK的节点读取的内容重新初始化ControllerContext,更新Controller和ISR leader的缓存信息。执行updateLeaderAndIsrCache(),把/brokers/topics/[topic]/partitions/[partition]/state路径的信息以及leaderIsrAndControllerEpoch信息更新到ControllerContext的内存缓存中。然后发送更新元数据的请求给brokers。

猜你喜欢

转载自blog.csdn.net/rover2002/article/details/106576411