kafka源码阅读-Controller(四)Replica状态机

分区状态机有四种状态,replica状态机有七种状态,较为复杂。
ReplicaStateMachine这个类代表副本集的状态机。它定义了一个副本所在的状态,以及将副本移动到另一个合法的状态的变迁。不同的状态如下:

  1. NewReplica: Controller可以在分配分区时创建新副本。在这个状态,一个副本只能请求成为follower。有效的前置状态是NonExistentReplica。
  2. OnlineReplica: replica一旦被启动,并且作为分配了partition的副本集的一部分,该副本就进入了上线状态。该状态下既可以成为leader也可以成为follower。有效的前置状态是NewReplica, OnlineReplica or OfflineReplica。
  3. OfflineReplica: 如果副本挂了,就移动到这个状态。这发生在持有该副本的broker宕机时。有效的前置状态是NewReplica, OnlineReplica。
  4. ReplicaDeletionStarted: 如果副本开始删除,进入这个状态。有效的前置状态是OfflineReplica。
  5. ReplicaDeletionSuccessful: 如果副本对于删除请求回复了没有错误,就进入这个状态。有效的前置状态是ReplicaDeletionStarted。
  6. ReplicaDeletionIneligible: 如果副本删除失败,进入这个状态。有效的前置状态是ReplicaDeletionStarted。
  7. NonExistentReplica: 如果副本删除成功了,进入这个状态。有效的前置状态是ReplicaDeletionSuccessful。

1. Replica的启动和上线

还是从onControllerFailover()看起,先调用partitionStateMachine和replicaStateMachine的registerListeners(),随后启动了replica状态机和partition状态机。

def onControllerFailover() {
    if(isRunning) {
      ......
      partitionStateMachine.registerListeners()
      replicaStateMachine.registerListeners()
      // 从Zookeeper读取各个路径,构造初始上下文。
      // 包括liveBrokers/allTopics/partitionReplicaAssignment/partitionLeadershipInfo
      //     每个分区的leaderAndIsrInfo等
      initializeControllerContext()
      sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)

      replicaStateMachine.startup()
      partitionStateMachine.startup()
      ......

Replica状态机在注册了BrokerChangeListener之后,
初始化Controller的上下文获取了Replica的初始状态,
然后启动Replica状态机。

启动Replica状态机的过程如下:

  def startup() {
    // initialize replica state
    initializeReplicaState()
    // set started flag
    hasStarted.set(true)
    // move all Online replicas to Online
    handleStateChanges(controllerContext.allLiveReplicas(), OnlineReplica)

    info("Started replica state machine with initial state -> " + replicaState.toString())
  }
  
  private def initializeReplicaState() {
    for((topicPartition, assignedReplicas) <- controllerContext.partitionReplicaAssignment) {
      val topic = topicPartition.topic
      val partition = topicPartition.partition
      assignedReplicas.foreach { replicaId =>
        val partitionAndReplica = PartitionAndReplica(topic, partition, replicaId)
        // 如果replica所在的Broker是alive的
        if (controllerContext.liveBrokerIds.contains(replicaId))
          replicaState.put(partitionAndReplica, OnlineReplica)
        else
          // mark replicas on dead brokers as failed for topic deletion, if they belong to a topic to be deleted.
          // This is required during controller failover since during controller failover a broker can go down,
          // so the replicas on that broker should be moved to ReplicaDeletionIneligible to be on the safer side.
          replicaState.put(partitionAndReplica, ReplicaDeletionIneligible)
      }
    }
  }

initializeReplicaState():

初始化副本的状态时,对所有zookeeper中已存在的partitions,从刚初始化的上下文中读取他们的已分配的副本。
如果副本所在的Broker是活着的,就在状态机缓存中标记为OnlineReplica,否则标记为ReplicaDeletionIneligible。这里只是在内存缓存中标记,并没有触发相应的状态改变动作。

初始化后Replica进入的初始状态可以是:

  • OnlineReplica
  • ReplicaDeletionIneligible

startup()中完成初始化后就触发一个OnlineReplica状态的handleStateChanges(),使每一个活着的replica上线:

  // 参数replicas是一个PartitionAndReplica的集合
  def handleStateChanges(replicas: Set[PartitionAndReplica], targetState: ReplicaState,
                         callbacks: Callbacks = (new CallbackBuilder).build) {
    if(replicas.nonEmpty) {
      info("Invoking state change to %s for replicas %s".format(targetState, replicas.mkString(",")))
      try {
        brokerRequestBatch.newBatch()
        replicas.foreach(r => handleStateChange(r, targetState, callbacks))
        brokerRequestBatch.sendRequestsToBrokers(controller.epoch)
      }catch {
        case e: Throwable => error("Error while moving some replicas to %s state".format(targetState), e)
      }
    }
  }

对Replica触发OnlineReplica状态:

handleStateChange(target to OnlineReplica)的处理片段如下:

        case OnlineReplica =>
          assertValidPreviousStates(partitionAndReplica,
            List(NewReplica, OnlineReplica, OfflineReplica, ReplicaDeletionIneligible), targetState)
          replicaState(partitionAndReplica) match {
            // 之前状态是 NewReplica.
            case NewReplica =>
              // 把这个副本添加到“已分配副本集的分区”的副本集列表里
              val currentAssignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition)
              if(!currentAssignedReplicas.contains(replicaId))
                controllerContext.partitionReplicaAssignment.put(topicAndPartition, currentAssignedReplicas :+ replicaId)
              stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
                                        .format(controllerId, controller.epoch, replicaId, topicAndPartition, currState,
                                                targetState))
            case _ =>
              // 检查PartitionAndReplica的Leader是否存在过
              controllerContext.partitionLeadershipInfo.get(topicAndPartition) match {
                case Some(leaderIsrAndControllerEpoch) =>
                  brokerRequestBatch.addLeaderAndIsrRequestForBrokers(List(replicaId), topic, partition, leaderIsrAndControllerEpoch,
                    replicaAssignment)
                  // 进入OnlineReplica状态,立即上线。
                  replicaState.put(partitionAndReplica, OnlineReplica)
                  stateChangeLogger.trace("Controller %d epoch %d changed state of replica %d for partition %s from %s to %s"
                    .format(controllerId, controller.epoch, replicaId, topicAndPartition, currState, targetState))
                case None => // that means the partition was never in OnlinePartition state, this means the broker never
                  // started a log for that partition and does not have a high watermark value for this partition
              }
          }
          replicaState.put(partitionAndReplica, OnlineReplica)

controllerContext.partitionReplicaAssignment 是一个以TopicAndPartition为key,以replicas列表为value的map。partition有多个副本,如果之前状态是NewReplica,当前副本可能已经更新到已分配的副本集列表里了,否则的话这里重新更新map的列表value。

扫描二维码关注公众号,回复: 11315276 查看本文章

如果之前不是NewReplica,就需要先检查这个partition的leader是否存在,如果存在就把该副本标记为上线状态;否则表示该副本的分区没有进入上线状态过,就是说broker从没有启动这个分区的日志,也没有这个分区的high watermark值,那么什么也不做。

最后标记当前分区和副本集为OnlineReplica状态。

启动过程中,可能的状态转移有:

  • OnlineReplica -> OnlineReplica
  • ReplicaDeletionIneligible –> OnlineReplica

2. NewReplica的上线

什么情况下会进入NewReplica状态呢?

一是创建新Topic时,为topic创建新的Partition并上线,partition的Replicas进入NewReplica随后上线(OnlineReplica状态)。
二是partition重新分配replicas时(比如用admin命令执行)。

第一种情况:

  // KafkaController.scala
  def onNewPartitionCreation(newPartitions: Set[TopicAndPartition]) {
    info("New partition creation callback for %s".format(newPartitions.mkString(",")))
    // NewPartition NewReplica
    partitionStateMachine.handleStateChanges(newPartitions, NewPartition)
    replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions), NewReplica)
    // Partition上线,Replicas上线
    partitionStateMachine.handleStateChanges(newPartitions, OnlinePartition, offlinePartitionSelector)
    replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions), OnlineReplica)
  }

这就解释了下面的状态转移:

  • NewReplica -> OnlineReplica

3. Replica下线

什么情况下会进入OfflineReplica状态呢?
Controller中下面几种情况:

  1. shutdownBroker()时,如果这个Broker是个Follower,它上面的所有分区和副本都将下线。
  2. onBrokerFailure()时,Dead Broker上除了即将删除的Topic的所有Replicas,成为了dead replicas,被下线。
  3. onPartitionReassignment()时,partition从原来分配的replicas(OAR)到重新分配的replicas(RAR)转换时,旧的replicas将被强制删除。此时会走一遍:OfflineReplica->ReplicaDeletionStarted->ReplicaDeletionSuccessful->NonExistentReplica流程。

4. Replica的删除

如3中的情况3,KafkaController中分配分区及其replicas时,进入ReplicaDeletionStarted状态。此外还有TopicDeletionManager中:

TopicDeletionManager.start()时,
启动class DeleteTopicsThread() extends ShutdownableThread()
->onTopicDeletion()
—>onPartitionDeletion()
------> startReplicaDeletion() 向replica状态机传入一个deleteTopicStopReplicaCallback()。
回调deleteTopicStopReplicaCallback时执行failReplicaDeletion()然后completeReplicaDeletion()。

TopicDeletionManager什么时候启动呢?

Controller在onControllerFailover()的最后一步启动,TopicDeletionManager判断如果是isDeleteTopicEnabled的话,就执行删除。能否删除topic来自配置文件。

  val isDeleteTopicEnabled = controller.config.deleteTopicEnable

那么Controller刚启动partition状态机和Replica状态机,使他们上线,就删除所有topic吗?
不用担心,因为TopicDeletionManager在回调时只删除markTopicIneligibleForDeletion()过的topic。topicsToBeDeleted这个Set中存放了希望删除的topics,它是在enqueueTopicsForDeletion时添加的,初始时为空。
要删除的topics什么时候入队呢?
DeleteTopicsListener被触发的时候。partition状态机在Controller中启动前先注册了DeleteTopicsListener这个Listener。

KafkaController PartitionStateMachin zkClient DeleteTopicListener ReplicaStateMachin DeleteTopicManager registerListeners() registerDeleteTopicListener subscribeChildChanges 监听节点 registerListeners() startup() startup() start() opt [ onControllerFailover() ] doHandleChildChange() markTopicIneligibleForDeletion() 标记topic准备删除 enqueueTopicsForDeletion()添加到topicsToBeDeleted add topic to deletion list resumeTopicDeletionThread() 向条件变量发signal. KafkaController PartitionStateMachin zkClient DeleteTopicListener ReplicaStateMachin DeleteTopicManager

触发删除Topic的时序图如上图所示。 此时,ZK上的节点触发了DeleteTopicsListener,回调中标记了Ineligible for Deletion,并resume了删除线程,早已启动的DeleteTopicsThread线程现在满足了可删除的条件,将从topicsToBeDeleted中取出要删除的topic,执行删除。线程函数主要逻辑如下:
        topicsQueuedForDeletion.foreach { topic =>
          // 如果所有的副本集都标记为删除成功,topic的删除就完成了
          if(controller.replicaStateMachine.areAllReplicasForTopicDeleted(topic)) {
            // 从controller缓存和ZK中,清理这个topic的所有状态
            completeDeleteTopic(topic)
            info("Deletion of topic %s successfully completed".format(topic))
          } else {
            if(controller.replicaStateMachine.isAtLeastOneReplicaInDeletionStartedState(topic)) {
              // 忽略删除中的topic
              val replicasInDeletionStartedState = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionStarted)
              val replicaIds = replicasInDeletionStartedState.map(_.replica)
              val partitions = replicasInDeletionStartedState.map(r => TopicAndPartition(r.topic, r.partition))
              info("Deletion for replicas %s for partition %s of topic %s in progress".format(replicaIds.mkString(","),
                partitions.mkString(","), topic))
            } else {
              // 如果走到这儿,那么没有replica处于TopicDeletionStarted状态,并且所有replicas没有处于TopicDeletionSuccessful,
              // 这意味着该topic没有发起删除,或者至少一个replica挂了(删除topic需要重新尝试)
              if(controller.replicaStateMachine.isAnyReplicaInState(topic, ReplicaDeletionIneligible)) {
                // mark topic for deletion retry
                markTopicForDeletionRetry(topic)
              }
            }
          }
          if(isTopicEligibleForDeletion(topic)) {
            info("Deletion of topic %s (re)started".format(topic))
            // topic deletion will be kicked off
            onTopicDeletion(Set(topic))
          } else if(isTopicIneligibleForDeletion(topic)) {
            info("Not retrying deletion of topic %s at this time since it is marked ineligible for deletion".format(topic))
          }
      }

猜你喜欢

转载自blog.csdn.net/rover2002/article/details/106714963