Flink源码解析(standalone)之taskmanager启动

1、简单粗暴,flink-daemon.sh脚本可知taskmanager执行类为:org.apache.flink.runtime.taskmanager.TaskManager
2、main方法里面,最主要的就是启动taskmanager

try {
      SecurityUtils.getInstalledContext.runSecured(new Callable[Unit] {
        override def call(): Unit = {
        //运行taskmanager,记住classOf[TaskManager],这是taksManagerActor的启动类,生命周期方法在此类中
          selectNetworkInterfaceAndRunTaskManager(configuration, resourceId, classOf[TaskManager])
        }
      })
    }

3、selectNetworkInterfaceAndRunTaskManager里面主要做了三件事:
a、创建高可用服务
b、给taskmanager分配主机、端口范围
c、启动taskmanager

  def selectNetworkInterfaceAndRunTaskManager(
      configuration: Configuration,
      resourceID: ResourceID,
      taskManagerClass: Class[_ <: TaskManager])
    : Unit = {

    val highAvailabilityServices = HighAvailabilityServicesUtils.createHighAvailabilityServices(
      configuration,
      Executors.directExecutor(),
      AddressResolution.TRY_ADDRESS_RESOLUTION)
	//选择网络接口和端口范围
    val (taskManagerHostname, actorSystemPortRange) = selectNetworkInterfaceAndPortRange(
      configuration,
      highAvailabilityServices)

    try {
    //启动taksmanager
      runTaskManager(
        taskManagerHostname,
        resourceID,
        actorSystemPortRange,
        configuration,
        highAvailabilityServices,
        taskManagerClass)
    } finally {
      try {
        highAvailabilityServices.close()
      } catch {
        case t: Throwable => LOG.warn("Could not properly stop the high availability services.", t)
      }
    }
  }

4、进入runTaskManager方法,里面主要是根据上面分配的端口范围,找到可用的端口分配给taskmanager通信使用,然后调用重载的runTaskManager方法启动taskmanager

def runTaskManager(
    taskManagerHostname: String,
    resourceID: ResourceID,
    actorSystemPortRange: java.util.Iterator[Integer],
    configuration: Configuration,
    highAvailabilityServices: HighAvailabilityServices,
    taskManagerClass: Class[_ <: TaskManager])
    : Unit = {
	//通过创建socket,找到可用的端口
    val result = AkkaUtils.retryOnBindException({
      // Try all ports in the range until successful
      val socket = NetUtils.createSocketFromPorts(
        actorSystemPortRange,
        new NetUtils.SocketFactory {
          override def createSocket(port: Int): ServerSocket = new ServerSocket(
            // Use the correct listening address, bound ports will only be
            // detected later by Akka.
            port, 0, InetAddress.getByName(NetUtils.getWildcardIPAddress))
        })

      val port =
        if (socket == null) {
          throw new BindException(s"Unable to allocate port for TaskManager.")
        } else {
          try {
            socket.getLocalPort()
          } finally {
            socket.close()
          }
        }

      runTaskManager(
        taskManagerHostname,
        resourceID,
        port,
        configuration,
        highAvailabilityServices,
        taskManagerClass)
    }, { !actorSystemPortRange.hasNext }, 5000)

    result match {
      case scala.util.Failure(f) => throw f
      case _ =>
    }
  }

5、进入重载的runTaskManager
5.1、创建一个taskManagerActorSystem

    val taskManagerSystem = BootstrapTools.startActorSystem(
      configuration,
      taskManagerHostname,
      actorSystemPort,
      LOG.logger)

5.2、创建一个MetricRegistry,并启动初始化服务

val metricRegistry = new MetricRegistryImpl(
      MetricRegistryConfiguration.fromConfiguration(configuration))

    metricRegistry.startQueryService(taskManagerSystem, resourceID)

5.3、启动taskmanager组件和taskmanagerActor

val taskManager = startTaskManagerComponentsAndActor(
        configuration,
        resourceID,
        taskManagerSystem,
        highAvailabilityServices,
        metricRegistry,
        taskManagerHostname,
        Some(TaskExecutor.TASK_MANAGER_NAME),
        localTaskManagerCommunication = false,
        taskManagerClass)

5.3.1、启动taskmanagerActor后,进入生命周期方法prestart,里面主要就是启动了一个检索leader jobmanager的检索器,因为是standalone模式,所以直接告知leader jobmanager地址

leaderRetrievalService.start(this)
//查看StandaloneLeaderRetrievalService的start方法
public void start(LeaderRetrievalListener listener) {
		checkNotNull(listener, "Listener must not be null.");

		synchronized (startStopLock) {
			checkState(!started, "StandaloneLeaderRetrievalService can only be started once.");
			started = true;

			// 直接通知监听器,告知leader jobmanager地址
			listener.notifyLeaderAddress(leaderAddress, leaderId);
		}
	}

5.3.2 进入taskmanager的notifyLeaderAddress方法,里面给taskmanagerActor发送了JobManagerLeaderAddress消息

override def notifyLeaderAddress(leaderAddress: String, leaderSessionID: UUID): Unit = {
    self ! JobManagerLeaderAddress(leaderAddress, leaderSessionID)
  }

5.3.3 进入taskmanagerActor的handleMessage方法,找到JobManagerLeaderAddress,处理逻辑如下:
1、如果taskmanager中已存储的有leader jobmanager地址(即已经与一个leader jobmanager保持着连接),则先与旧的leader jobmanager断开连接
2、触发taskmanager到jobmanager中注册

case JobManagerLeaderAddress(address, newLeaderSessionID) =>
      handleJobManagerLeaderAddress(address, newLeaderSessionID)

private def handleJobManagerLeaderAddress(
      newJobManagerAkkaURL: String,
      leaderSessionID: UUID)
    : Unit = {

    currentJobManager match {
      case Some(jm) =>
        Option(newJobManagerAkkaURL) match {
          case Some(newJMAkkaURL) =>
          //与旧的leader jobmanager断开连接
            handleJobManagerDisconnect(s"JobManager $newJMAkkaURL was elected as leader.")
          case None =>
            handleJobManagerDisconnect(s"Old JobManager lost its leadership.")
        }
      case None =>
    }

    this.jobManagerAkkaURL = Option(newJobManagerAkkaURL)
    this.leaderSessionID = Option(leaderSessionID)

    if (this.leaderSessionID.isDefined) {
      // 触发taskmanager注册
      triggerTaskManagerRegistration()
    }
  }

5.3.4 给taskmanagerActor发送一个注册消息TriggerTaskManagerRegistration

      self ! decorateMessage(
        TriggerTaskManagerRegistration(
          jobManagerAkkaURL.get,
          new FiniteDuration(
            config.getInitialRegistrationPause().getSize(),
            config.getInitialRegistrationPause().getUnit()),
          deadline,
          1,
          currentRegistrationRun)
      )

5.3.5 注册逻辑:

case message: RegistrationMessage => handleRegistrationMessage(message)


	5.3.5.1、如果已经注册过,打印日志
if (isConnected) {
            // this may be the case, if we queue another attempt and
            // in the meantime, the registration is acknowledged
            log.debug(
              "TaskManager was triggered to register at JobManager, but is already registered")
          } 

5.3.5.2、如果在指定直接内没有注册成功则放弃注册

 else if (deadline.exists(_.isOverdue())) {
            // we failed to register in time. that means we should quit
            log.error("Failed to register at the JobManager within the defined maximum " +
                        "connect time. Shutting down ...")

            // terminate ourselves (hasta la vista)
            self ! decorateMessage(PoisonPill)
          }

5.3.5.3、向jobmanagerActor发送注册消息

val jobManager = context.actorSelection(jobManagerURL)

            jobManager ! decorateMessage(
              RegisterTaskManager(
                resourceID,
                location,
                resources,
                numberOfSlots)
            )

5.3.5.3.1 jobmanagerActor收到taskmanager的注册消息(jobmanager.handleMessage方法中),如果resourcemanager已经在jobmanager中注册,则通知resourcemanager在给定的资源容器中启动taskmanager(同步通信),如果resourcemanager启动正常,则回一个确认该taskmanager已经资源注册的消息

currentResourceManager match {
        case Some(rm) =>
          val future = (rm ? decorateMessage(new NotifyResourceStarted(msg.resourceId)))(timeout)
          future.onFailure {
            case t: Throwable =>
              t match {
                case _: TimeoutException =>
                  log.info("Attempt to register resource at ResourceManager timed out. Retrying")
                case _ =>
                  log.warn("Failure while asking ResourceManager for RegisterResource. Retrying", t)
              }
              self ! decorateMessage(
                new ReconnectResourceManager(
                  rm,
                  currentResourceManagerConnectionId))
          }(context.dispatcher)

        case None =>
          log.info("Task Manager Registration but not connected to ResourceManager")
      }

5.3.5.3.2 如果已经注册过了,则发消息给taskmanagerActor,表示该taskmanager已经存在了

if (instanceManager.isRegistered(resourceId)) {
        val instanceID = instanceManager.getRegisteredInstance(resourceId).getId

        taskManager ! decorateMessage(
          AlreadyRegistered(
            instanceID,
            blobServer.getPort))
      }

5.3.5.3.3 如果没有注册过,则注册,并返回确认注册的消息给taskmanagerActor

taskManager ! decorateMessage(
            AcknowledgeRegistration(instanceID, blobServer.getPort))

5.3.5.3.3.1 taskmanagerActor在接收到反馈的消息后主要做了几件事:
1、启动了BLOB缓存
2、监听jobmanager,在jobmanager挂掉后能及时知道
3、启动和jobmanager直接的心跳机制

5.3.5.3.4 监听改注册的taskmanagerActor,taskmanager挂掉后能及时知道

context.watch(taskManager)

5.3.5.4 定义一个指定时间后注册的定时调度任务,防止因为网络等原因没有注册上,类似递归操作,一直到注册成功或者超过指定的注册截止日期放弃为止。

            val nextTimeout = (timeout * 2).min(new FiniteDuration(
              config.getMaxRegistrationPause().toMilliseconds,
              TimeUnit.MILLISECONDS))

            // schedule a check to trigger a new registration attempt if not registered
            // by the timeout
            scheduledTaskManagerRegistration = Option(context.system.scheduler.scheduleOnce(
              timeout,
              self,
              decorateMessage(TriggerTaskManagerRegistration(
                jobManagerURL,
                nextTimeout,
                deadline,
                attempt + 1,
                registrationRun)
              ))(context.dispatcher))

5.4、启动一个taskmanagerActor监测,在taskmanagerActor挂掉后kill掉JVM进程

taskManagerSystem.actorOf(
        Props(classOf[ProcessReaper], taskManager, LOG.logger, RUNTIME_FAILURE_RETURN_CODE),
        "TaskManager_Process_Reaper")

猜你喜欢

转载自blog.csdn.net/a376554764/article/details/84143432