总结:
1.用户执行提交脚本org.apache.flink.client.cli.CliFrontend通过反射执行用户代码的main方法
2.执行execute方法先生成StreamGraph
3.将StreamGraph转化成JobGraph
4.向Yarn的ResourceManager提交application
5.YarnResourceManager向其中的NodeManager上启动ApplicationMaster
5.1 ApplicationMaster启动Dispatcher
5.2 ApplicationMaster启动Flink ResourceManager
5.3 Dispatcher 启动 JobMaster
5.4 JobMaster 创建slotpool,生成ExecutionGraph
5.4 JobMaster 的slotpool 向 Flink ResourceManager的slotManager注册请求slot
5.5 slotManager向YarnResourceManager申请 新的worker
6.YarnResourceManager在nodeManager上启动TaskManager(YarnTaskExecutorRunner)
7.YarnTaskExecutorRunner启动TaskExecutor,TaskExecutor里面有slot
8.TaskExecutor向FlinkResourceManager中的slotManager注册slot
9.slotManager给TaskExecutor分配slot(告知TaskManager将slot分配给哪个jobmaster)
10.TaskExecutor给JobMaster提供slot
11.JobMaster提交任务给TaskExecutor执行
1、由脚本bin/flink 进入org.apache.flink.client.cli.CliFrontend
CliFrontend
main -> cli.parseAndRun -> run -> executeProgram -> ClientUtils.executeProgram
-> program.invokeInteractiveModeForExecution() -> callMainMethod
-> mainMethod.invoke(null, (Object) args);
到这里开始执行用户写的程序
当执行到env.exectute()时进入 StreamExecutionEnvironment.execute()
StreamExecutionEnvironment
execute -> execute(getStreamGraph(jobName))-> executeAsync
CompletableFuture<JobClient> jobClientFuture = executorFactory
.getExecutor(configuration)
.execute(streamGraph, configuration, userClassloader);
AbstractJobClusterExecutor
public CompletableFuture<JobClient> execute(@Nonnull final Pipeline pipeline, @Nonnull final Configuration configuration, @Nonnull final ClassLoader userCodeClassloader) throws Exception {
/*TODO 将 流图(StreamGraph) 转换成 作业图(JobGraph)*/
final JobGraph jobGraph = PipelineExecutorUtils.getJobGraph(pipeline, configuration);
/*TODO 集群描述器:创建、启动了 YarnClient, 包含了一些yarn、flink的配置和环境信息*/
try (final ClusterDescriptor<ClusterID> clusterDescriptor = clusterClientFactory.createClusterDescriptor(configuration)) {
final ExecutionConfigAccessor configAccessor = ExecutionConfigAccessor.fromConfiguration(configuration);
/*TODO 集群特有资源配置:JobManager内存、TaskManager内存、每个Tm的slot数*/
final ClusterSpecification clusterSpecification = clusterClientFactory.getClusterSpecification(configuration);
// TODO deployJobCluster
final ClusterClientProvider<ClusterID> clusterClientProvider = clusterDescriptor
.deployJobCluster(clusterSpecification, jobGraph, configAccessor.getDetachedMode());
LOG.info("Job has been submitted with JobID " + jobGraph.getJobID());
return CompletableFuture.completedFuture(
new ClusterClientJobClientAdapter<>(clusterClientProvider, jobGraph.getJobID(), userCodeClassloader));
}
}
---> clusterClientFactory.createClusterDescriptor -> YarnClusterClientFactory.createClusterDescriptor -> getClusterDescriptor
private YarnClusterDescriptor getClusterDescriptor(Configuration configuration) {
/*TODO 创建了YarnClient*/
final YarnClient yarnClient = YarnClient.createYarnClient();
final YarnConfiguration yarnConfiguration = new YarnConfiguration();
/*TODO 初始化、启动 YarnClient*/
yarnClient.init(yarnConfiguration);
yarnClient.start();
return new YarnClusterDescriptor(
configuration,
yarnConfiguration,
yarnClient,
YarnClientYarnClusterInformationRetriever.create(yarnClient),
false);
}
---> deployJobCluster-> YarnClusterDescriptor.deployJobCluster
-> deployInternal
-> startAppMaster(启动AM) 注意 appmaster(ApplicationMaster) 和 jobMaster的区别
-> setupApplicationMasterContainer(这里面封装启动YarnJobClusterEntrypoint的命令
$JAVA_HOME/bin/java $jvmHeapMem $javaOpts $logging YarnJobClusterEntrypoint)
-> yarnClient.submitApplication(appContext) 提交应用到yarn
-> YarnJobClusterEntrypoint.main
2、启动 YarnJobClusterEntrypoint ( 与 StandaloneSessionClusterEntrypoint) 类似 endpoint 终端
-> YarnJobClusterEntrypoint.main
ClusterEntrypoint.runClusterEntrypoint(yarnJobClusterEntrypoint);
-> clusterEntrypoint.startCluster();
-> runCluster
/*TODO 创建和启动 JobManager里的组件:Dispatcher、ResourceManager、JobMaster*/
clusterComponent = dispatcherResourceManagerComponentFactory.create(
/*TODO 创建 ResourceManager:Yarn模式的 ResourceManager*/
// ActiveResourceManagerFactory -> ResourceManager
-> resourceManager = resourceManagerFactory.createResourceManager(
2.1 创建 ResourceManager
里面创建slotmanager
resourceManager = resourceManagerFactory.createResourceManager(
-> ActiveResourceManagerFactory.createResourceManager -> super.createResourceManager
-> ResourceManagerFactory.createResourceManagerRuntimeServices
-> ResourceManagerRuntimeServices.fromConfiguration
-> ResourceManagerRuntimeServices.createSlotManager(创建slotmanager)
-> ResourceManagerFactory.createResourceManager -> ActiveResourceManagerFactory.createResourceManager
-> new ActiveResourceManager -> super -> public ResourceManager() 至此创建好ResourceManager
2.2 ResourceManager启动
ResourceManager构造方法完成后执行onStart方法,里面会启动slotManager
ResourceManager.onStart -> startResourceManagerServices
private void startResourceManagerServices() throws Exception {
try {
leaderElectionService = highAvailabilityServices.getResourceManagerLeaderElectionService();
/*TODO 创建了Yarn的RM和NodeManager的客户端,初始化并启动 ActiveResourceManager*/
initialize();
/*TODO 通过选举服务,启动ResourceManager -> 选举成功到 grantLeadership */
leaderElectionService.start(this);
jobLeaderIdService.start(new JobLeaderIdActionsImpl());
registerTaskExecutorMetrics();
} catch (Exception e) {
handleStartResourceManagerServicesException(e);
}
}
(1).initialize -> ActiveResourceManager.initialize-> AbstractResourceManagerDriver.initialize
-> YarnResourceManagerDriver.initializeInternal
protected void initializeInternal() throws Exception {
final YarnContainerEventHandler yarnContainerEventHandler = new YarnContainerEventHandler();
try {
/*TODO 创建Yarn的ResourceManager的客户端,并且初始化和启动*/
resourceManagerClient = yarnResourceManagerClientFactory.createResourceManagerClient(
yarnHeartbeatIntervalMillis,
yarnContainerEventHandler);
resourceManagerClient.init(yarnConfig);
resourceManagerClient.start();
// TODO 向ApplicationMaster注册
final RegisterApplicationMasterResponse registerApplicationMasterResponse = registerApplicationMaster();
getContainersFromPreviousAttempts(registerApplicationMasterResponse);
taskExecutorProcessSpecContainerResourcePriorityAdapter =
new TaskExecutorProcessSpecContainerResourcePriorityAdapter(
registerApplicationMasterResponse.getMaximumResourceCapability(),
ExternalResourceUtils.getExternalResources(flinkConfig, YarnConfigOptions.EXTERNAL_RESOURCE_YARN_CONFIG_KEY_SUFFIX));
} catch (Exception e) {
throw new ResourceManagerException("Could not start resource manager client.", e);
}
/*TODO 创建yarn的 NodeManager的客户端,并且初始化和启动*/
nodeManagerClient = yarnNodeManagerClientFactory.createNodeManagerClient(yarnContainerEventHandler);
nodeManagerClient.init(yarnConfig);
nodeManagerClient.start();
}
(2).leaderElectionService.start(this) -> grantLeadership -> tryAcceptLeadership -> startServicesOnLeadership
-> slotManager.start(启动slotManager)
2.3 创建并启动 Dispatcher
Dispatch 里面会创建和启动jobmaster
dispatcherRunner = dispatcherRunnerFactory.createDispatcherRunner(
-> DefaultDispatcherRunner.create
-> DispatcherRunnerLeaderElectionLifecycleManager.createFor
-> new DispatcherRunnerLeaderElectionLifecycleManager
-> leaderElectionService.start(dispatcherRunner); 选举成功就到对应组件的 grantLeadership方法
-> DefaultDispatcherRunner.grantLeadership
-> startNewDispatcherLeaderProcess
-> newDispatcherLeaderProcess::start -> AbstractDispatcherLeaderProcess.start-> startInternal
-> JobDispatcherLeaderProcess.onStart -> DefaultDispatcherGatewayServiceFactory.create()
public AbstractDispatcherLeaderProcess.DispatcherGatewayService create(
DispatcherId fencingToken,
Collection<JobGraph> recoveredJobs,
JobGraphWriter jobGraphWriter) {
final Dispatcher dispatcher;
try {
/*TODO 创建Dispatcher*/
dispatcher = dispatcherFactory.createDispatcher(
rpcService,
fencingToken,
recoveredJobs,
(dispatcherGateway, scheduledExecutor, errorHandler) -> new NoOpDispatcherBootstrap(),
PartialDispatcherServicesWithJobGraphStore.from(partialDispatcherServices, jobGraphWriter));
} catch (Exception e) {
throw new FlinkRuntimeException("Could not create the Dispatcher rpc endpoint.", e);
}
-> Dispatcher dispatcher = dispatcherFactory.createDispatcher -> MiniDispatcher createDispatcher
-> new MiniDispatcher -> super -> public Dispatcher() -> onStart()
/*TODO 启动 Dispatcher*/
dispatcher.start();
return DefaultDispatcherGatewayService.from(dispatcher);
}
2.3.1 创建和启动JobMaster
-> Dispatcher.onStart() -> startRecoveredJobs(这里面启动JobMaster)
-> runRecoveredJob -> runJob -> createJobManagerRunner
CompletableFuture<JobManagerRunner> createJobManagerRunner(JobGraph jobGraph, long initializationTimestamp) {
final RpcService rpcService = getRpcService();
return CompletableFuture.supplyAsync(
() -> {
try {
/*TODO 创建JobMaster DefaultJobManagerRunnerFactory*/
/**
* 这里应该叫jobmaster
* JobManager 是进程
* 里面启动resourcemanger,dispatcher,
* dispatcher里 创建和启动jobmaster
*/
JobManagerRunner runner = jobManagerRunnerFactory.createJobManagerRunner(
jobGraph,
configuration,
rpcService,
highAvailabilityServices,
heartbeatServices,
jobManagerSharedServices,
new DefaultJobManagerJobMetricGroupFactory(jobManagerMetricGroup),
fatalErrorHandler,
initializationTimestamp);
/*TODO 启动JobMaster*/
runner.start();
return runner;
} catch (Exception e) {
throw new CompletionException(new JobInitializationException(jobGraph.getJobID(), "Could not instantiate JobManager.", e));
}
},
ioExecutor); // do not use main thread executor. Otherwise, Dispatcher is blocked on JobManager creation
}
-> 创建JobMaster
jobManagerRunnerFactory.createJobManagerRunner -> DefaultJobManagerRunnerFactory.createJobManagerRunner
-> new JobManagerRunnerImpl -> jobMasterFactory.createJobMasterService-> new JobMaster
-> 启动JobMaster
runner.start()-> JobManagerRunnerImpl.start -> grantLeadership
-> verifyJobSchedulingStatusAndStartJobManager -> startJobMaster
-> jobMasterService.start -> JobMaster.startJobExecution
private Acknowledge startJobExecution(JobMasterId newJobMasterId) throws Exception {
/*TODO 真正启动JobMaster服务*/
startJobMasterServices();
/*TODO 重置和启动调度器 */
resetAndStartScheduler();
}
------------------------------JobMaster--------------------------------
JobMaster.startJobMasterServices
private void startJobMasterServices() throws Exception {
/*TODO 启动心跳服务:taskmanager、resourcemanager*/
startHeartbeatServices();
// start the slot pool make sure the slot pool now accepts messages for this leader
/*TODO 启动 slotpool*/
slotPool.start(getFencingToken(), getAddress(), getMainThreadExecutor());
//TODO: Remove once the ZooKeeperLeaderRetrieval returns the stored address upon start
// try to reconnect to previously known leader
reconnectToResourceManager(new FlinkException("Starting JobMaster component."));
// job is ready to go, try to establish connection with resource manager
// - activate leader retrieval for the resource manager
// - on notification of the leader, the connection will be established and
// the slot pool will start requesting slots
/**
* TODO 重点: 启动后 slot pool 开始向 slot manager 请求 slot
* TODO 与ResourceManager建立连接,slotpool开始 向slotmanager 请求资源(slot)
* -> StandaloneLeaderRetrievalService
* -> ResourceManagerLeaderListener.notifyLeaderAddress
*/
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
}
2.3.2 slotpool开始 向ResourceManager请求资源(slot)
JobMaster.startJobMasterServices
reconnectToResourceManager
-> tryConnectToResourceManager -> connectToResourceManager -> resourceManagerConnection.start();
RegisteredRpcConnection.start() {
checkState(!closed, "The RPC connection is already closed");
checkState(!isConnected() && pendingRegistration == null, "The RPC connection is already started");
/*TODO 创建注册对象*/
final RetryingRegistration<F, G, S> newRegistration = createNewRegistration();
if (REGISTRATION_UPDATER.compareAndSet(this, null, newRegistration)) {
/**
* TODO 开始注册,注册成功之后,调用
* jobMaster向ResourceManger 注册成功时调用 JobMaster.ResourceManagerConnection.onRegistrationSuccess()
* TaskExecutor向ResourceManger 注册成功时调用 TaskExecutorToResourceManagerConnection.onRegistrationSuccess
* -> TaskExecutor.ResourceManagerRegistrationListener.onRegistrationSuccess
* 哪个注册完成后调用哪个的onRegistrationSuccess
*/
newRegistration.startRegistration();
} else {
// concurrent start operation
newRegistration.cancel();
}
}
--> RegisteredRpcConnection.createNewRegistration(创建注册对象)-> generateRegistration
-> JobMaster.ResourceManagerConnection.generateRegistration
-> 通过ResourceManagerGateway 向ResourceManager注册 ResourceManager.registerJobManager
-->RegisteredRpcConnection.newRegistration.startRegistration(开始注册)
-> register -> invokeRegistration
注册完成后调用JobMaster.ResourceManagerConnection.onRegistrationSuccess()
-> establishResourceManagerConnection ->
/*TODO slotpool连接到ResourceManager,请求资源*/
->slotPool.connectToResourceManager(resourceManagerGateway);
-> SlotPoolImpl.connectToResourceManager -> requestSlotFromResourceManager -> resourceManagerGateway.requestSlot(通过resourceManagerGateway rpc远程调用 ResourceManager的requestSlot方法) -> ResourceManager.requestSlot
-> slotManager.registerSlotRequest(slotRequest)/**ResourceManager内部的 slotManager去向 Yarn的ResourceManager申请资源*/
-> SlotManagerImpl.registerSlotRequest()-> internalRequestSlot -> fulfillPendingSlotRequestWithPendingTaskManagerSlot
-> allocateResource -> resourceActions.allocateResource -> ResourceManager.ResourceActionsImpl.allocateResource
-> ActiveResourceManager.startNewWorker -> requestNewWorker -> resourceManagerDriver.requestResource(
-> YarnResourceManagerDriver.requestResource
2.4 TaskManager
jobMaster向resourceManager申请资源,resourcemanager无slot时会向yarn申请资源启动container来启动taskManager
(1)启动入口:YarnTaskExecutorRunner.main
runTaskManagerSecurely->TaskManagerRunner.runTaskManagerSecurely -> runTaskManager
-> TaskManagerRunner taskManagerRunner = new TaskManagerRunner
->TaskManagerRunner::createTaskExecutorService
-> startTaskManager -> new TaskExecutor -> public TaskExecutor() -> onStart
-> startTaskExecutorServices
(2)开始向ResouceManager注册startTaskExecutorServices
-> resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
-> ResourceManagerLeaderListener.notifyLeaderAddress -> notifyOfNewResourceManagerLeader
-> reconnectToResourceManager->tryConnectToResourceManager -> connectToResourceManager
-> resourceManagerConnection.start();
--> createNewRegistration -> generateRegistration
-> TaskExecutorToResourceManagerConnection.RetryingRegistration.generateRegistration
-> new TaskExecutorToResourceManagerConnection.ResourceManagerRegistration
--> newRegistration.startRegistration(开始注册);
-> TaskExecutorToResourceManagerConnection.ResourceManagerRegistration.invokeRegistration
-> resourceManager.registerTaskExecutor -> registerTaskExecutorInternal
注册完成后调用 TaskExecutorToResourceManagerConnection.onRegistrationSuccess
-> registrationListener.onRegistrationSuccess(
-> TaskExecutor.ResourceManagerRegistrationListener.onRegistrationSuccess
-> establishResourceManagerConnection
(3) 向 ResourceManager 注册 slot
-> resourceManagerGateway.sendSlotReport() -> slotManager.registerTaskManager -> SlotManagerImpl.registerSlot
(4) ResourceManager 分配 Slot
->SlotManagerImpl.registerSlot -> allocateSlot -> gateway.requestSlot
(5) TaskManager 提供 Slot
->TaskExecutor.requestSlot -> allocateSlot
(6) 连接上 job,提供 slot 给 JobMaster
TaskExecutor.requestSlot -> offerSlotsToJobManager(jobId); -> internalOfferSlotsToJobManager
-> jobMasterGateway.offerSlots -> JobMaster.offerSlots-> slotPool.offerSlots -> offerSlot
-> tryFulfillSlotRequestOrMakeAvailable
接下来见 task调度执行
总结