[] Flink Flink job scheduling process analysis

1 Overview

When users submit jobs to the cluster Flink, from the user's perspective, the only job processing logic correctly, you can output the correct result; and do not care about the job when it was scheduled, job applications and resources are allocated and how the job when it will end; but to understand the specific behavior at run-time jobs for our in-depth understanding of the principles Flink has a very big help, and we have guidance on how to write a more reasonable logic operations, so scheduling and resource allocation This paper analyzes in detail the job as well as the life cycle of a job.

2. Process Analysis

Community-based master mainline (1.11-SNAPSHOT), commit: 12f7873db54cfbc5bf853d66ccd4093f9b749c9a, HA-based analysis to achieve ZK

Flink flowchart job application
The figure summarizes Flink end to job submission from the Client to the basic process filed Flink clusters [1].

When running ./flink runthe script user submitting the job to Dispathcer, Dispatcher will pull JobManagerRunner, then JobManagerRunner will Zookeeper registered to compete Leader. For the prior processes are interested can refer to in-depth understanding Flink-On-Yarn mode

When JobManagerRunnercompeting to become Leader, will call JobManagerRunnerImpl#grantLeadership, then began to deal with the job, starts JobMaster call path through the code below.

  • JobManagerRunnerImpl#grantLeadership
  • JobManagerRunnerImpl#verifyJobSchedulingStatusAndStartJobManager
  • JobManagerRunnerImpl # startJobMaster.
    scheduling jobs JobManagerRunner; ID startJobMaster method will first write the corresponding ZK job directory and set the RUNNING state, can be used to write to the directory used when Dispathcer upon receiving a job, the job is determined whether or repeatedly submit job recovery also pulling job information from the job status to determine ZK, if it is DONE state, no scheduling. It will first start when you start JobMaster its RPC Endpoint, for RPC calls with other components, after JobMaster began execution of the job by JobMaster # startJobExecution, would have some pre-check before the execution of the job, such as the need to ensure that run in the main thread; start JobMaster Some services on the (assembly), such as heartbeat and management TaskManager the ResourceManager; start SlotPool, Scheduler; reconnect to the ResourceManager, and monitor changes registered in ZK ResourceManager Leader of the Retriever and so on.
    When the initialization complete JobMaster respective service (component), they start scheduling, the code will be the following call path
  • JobMaster#start
  • JobMaster # startJobExecution
  • JobMaster # resetAndStartScheduler
  • JobMaster # start scheduling
  • SchedulerBase#startScheduling。

We know that the user's job is written JobGraph submitted to Dispatcher, but in the actual schedule will JobGraph into ExecutionGraph, JobGraph generated when the conversion is done ExecutionGraph SchedulerBase object initialization, as shown in FIG showing a typical conversion process (JobVertex and ExecutionJobVertex correspondence), and specific conversion logic may be implemented with reference to how to generate physical and executes ExecutionGraph

JobGraph->ExecutionGraph

After generating ExecutionGraph SchedulerBase at initialization, after they ExecutionGraph based scheduling, and scheduling base class SchedulerBase default implementations for DefaultScheduler, will continue to schedule jobs DefaultScheduler # startSchedulingInternal case, the status will job (ExecutionGraph) is changed from the state RUNNING state CREATED this time to see the task at Flink web interface state is already as RUNNING, but note that this time the job (each vertex) is not actually scheduled start, or in the apex CREATED status, job status and any vertex incomplete state associated with it the evolution of life cycle of each specific reference Flink job scheduling [2]; and (primarily for batch jobs according to different policies EagerSchedulingStrategy (mainly for streaming operations, all vertices (ExecutionVertex) simultaneously start scheduling) and LazyFromSourcesSchedulingStrategy, from the Source start scheduling is started, the other vertices delay scheduling) scheduling.

When lifting the AC assignment, call path have the following code:

  • EagerSchedulingStrategy#startScheduling
  • EagerSchedulingStrategy#allocateSlotsAndDeploy,在部署之前会根据待部署的ExecutionVertex生成对应的ExecutionVertexDeploymentOption,然后调用DefaultScheduler#allocateSlotsAndDeploy开始部署。同样,在部署之前也需要进行一些前置校验(ExecutionVertex对应的Execution的状态必须为CREATED),接着将待部署的ExecutionVertex对应的Execution状态变更为SCHEDULED,然后开始为ExecutionVertex分配Slot。会有如下的调用代码路径:
  • DefaultScheduler#allocateSlots(该过程会ExecutionVertex转化为ExecutionVertexSchedulingRequirements,会封装包含一些location信息、sharing信息、资源信息等)
  • DefaultExecutionSlotAllocator#allocateSlotsFor,该方法会开始逐一异步部署各ExecutionVertex,部署也是根据不同的Slot提供策略来分配,接着会经过如下代码调用路径层层转发,SlotProviderStrategy#allocateSlot -> SlotProvider#allocateSlot(SlotProvider默认实现为SchedulerImpl) -> SchedulerImpl#allocateSlotInternal -> SchedulerImpl#internalAllocateSlot(该方法会根据vertex是否共享slot来分配singleSlot/SharedSlot),以singleSlot为例说明。
    在分配slot时,首先会在JobMaster中SlotPool中进行分配,具体是先SlotPool中获取所有slot,然后尝试选择一个最合适的slot进行分配,这里的选择有两种策略,即按照位置优先和按照之前已分配的slot优先;若从SlotPool无法分配,则通过RPC请求向ResourceManager请求slot,若此时并未连接上ResourceManager,则会将请求缓存起来,待连接上ResourceManager后再申请。

当ResourceManager收到申请slot请求时,若发现该JobManager未注册,则直接抛出异常;否则将请求转发给SlotManager处理,SlotManager中维护了集群所有空闲的slot(TaskManager会向ResourceManager上报自己的信息,在ResourceManager中由SlotManager保存Slot和TaskManager对应关系),并从其中找出符合条件的slot,然后向TaskManager发送RPC请求申请对应的slot。

等待所有的slot申请完成后,然后会将ExecutionVertex对应的Execution分配给对应的Slot,即从Slot中分配对应的资源给Execution,完成分配后可开始部署作业。
部署作业代码调用路径如下:

  • DefaultScheduler#waitForAllSlotsAndDeploy
  • DefaultScheduler#deployAll
  • DefaultScheduler#deployOrHandleError
  • DefaultScheduler#deployTaskSafe
  • DefaultExecutionVertexOperations#deploy
  • ExecutionVertex#deploy
  • Execution#deploy(每次调度ExecutionVertex,都会有一个Execute,在此阶段会将Execution的状态变更为DEPLOYING状态,并且为该ExecutionVertex生成对应的部署描述信息,然后从对应的slot中获取对应的TaskManagerGateway,以便向对应的TaskManager提交Task)
  • RpcTaskManagerGateway#submitTask(此时便将Task通过RPC提交给了TaskManager)。

TaskManager(TaskExecutor)在接收到提交Task的请求后,会经过一些初始化(如从BlobServer拉取文件,反序列化作业和Task信息、LibaryCacheManager等),然后这些初始化的信息会用于生成Task(Runnable对象),然后启动该Task,其代码调用路径如下 Task#startTaskThread(启动Task线程)-> Task#run(将ExecutionVertex状态变更为RUNNING状态,此时在FLINK web前台查看顶点状态会变更为RUNNING状态,另外还会生成了一个AbstractInvokable对象,该对象是FLINK衔接执行用户代码的关键,而后会经过如下调用

  • AbstractInvokable#invoke(AbstractInvokable有几个关键的子类实现, BatchTask/BoundedStreamTask/DataSinkTask/DataSourceTask/StreamTask/SourceStreamTask。对于streaming类型的Source,会调用StreamTask#invoke)
  • StreamTask#invoke
  • StreamTask#beforeInvoke
  • StreamTask#initializeStateAndOpen(初始化状态和进行初始化,这里会调用用户的open方法(如自定义实现的source))-> StreamTask#runMailboxLoop,便开始处理Source端消费的数据,并流入下游算子处理。

至此作业从提交到资源分配及调度运行整体流程就已经分析完毕,对于流式作业而言,正常情况下其会一直运行,不会结束。

3. 总结

After running for a job, will be submitted to the Dispatcher, the Dispatcher pulled JobManagerRunner, in JobManagerRunner become Leader, processing job starts, first of all others will first become RUNNING state according to generate a corresponding JobGraph ExecutionGraph, then start scheduling, jobs, then for each slot ExecutionVertex application, the application will be directed slot JM and RM, the communication between the TM, after completion of dispensing slot in the TM, can be submitted to the Task TaskManager, and then generates a TaskManager separately for each submitted Task threading.

reference

  1. https://www.infoq.cn/article/RWTM9o0SHHV3Xr8o8giT
  2. https://flink.sojb.cn/internals/job_scheduling.html

Guess you like

Origin www.cnblogs.com/leesf456/p/12232568.html