spark运行时的消息通信源码阅读(二)

概要

(spark 版本为2.1.1)

应用程序(Application): 基于Spark的用户程序,包含了一个Driver Program 和集群中多个的Executor;

驱动程序(Driver Program):运行Application的main()函数并且创建SparkContext,通常用SparkContext代表Driver Program;

执行单元(Executor): 是为某Application运行在Worker Node上的一个进程,该进程负责运行Task,并且负责将数据存在内存或者磁盘上,每个Application都有各自独立的Executors;

集群管理程序(Cluster Manager): 在集群上获取资源的外部服务(例如:Standalone、Mesos或Yarn);

操作(Operation):作用于RDD的各种操作分为Transformation和Action;

角色

可以将spark的运行过程分为三块,以standalone为例:

1、客户端; 2、Master 3、Worker

再细分一下

1、客户端可以细化为: Driver、sparkContext

扫描二维码关注公众号,回复: 4726311 查看本文章

2、Master就是Master

3、worker可以细分为:Executor

再细分一下:

1、客户端中的sparkContext可以细分为:DAGScheduler、TaskScheduler

2、Master还是Master

3、Worker可以细分为:线程池、TaskRunner

画个图就是这么个意思:

具体过程用户提交应用程序时:

1、提交Spark任务,spark-submit提交application

2、使用spark-submit使用Standalone时会创建和构造一个DriverActor进程。

3、Driver执行编写的代码,执行到在main函数中创建sparkContext,构建Spark Application的运行环境。

4、SparkContext(对象),在初始化的时候,做的最重要的两件事情,就是构造出来DAGScheduler和TaskScheduler。

5、TaskScheduler(有自己的后台进程),实际上负责,通过它对应的一个后台进程,去连接Master,向Master注册Application。

6、Master,接收到Application注册的请求之后,Master会给Client返回一个注册结果,Client将该Application标注为已注册,并去连接Worker,会使用自己的资源调度算法,在spark集群的多个Worker上,为这个Application启动多个Executor(StandaloneExecutorBackend)。

7、Master通知Worker启动Executor。

8、Worker会为Applicator启动Executor。

9、Executor(进程),启动之后,会自己反向注册到这个Application对应的这个SparkContext里面的的TaskScheduler上去,这时TaskScheduler就知道自己服务于当前这个Application应用的Executor有哪些了,除此以外,Executor会向Master发送心跳信息,并申请Task(??????)。

10、所有Executor都反向注册到Driver上之后,Driver结束SparkContext初始化,会继续执行我们自己编写的代码。

11、每执行到一个action,就会创建一个job。

12、job,会提交给DAGScheduler。

13、DAGScheduler,会将job划分为多个stage,然后每个stage创建一个TaskSet。(stage,stage划分算法)。

14、每个TaskSet会提交给TaskScheduler。

15、TaskScheduler,会把TaskSet里每一个task提交到executor上执行。所以,之前哪些executor是注册到这个TaskScheduler上面来,那么TaskScheduler在接收到TaskSet的时候,就会把Task提交到那些executor上面去。(task分配算法)

16、Executor(进程),有一个线程池,每接收到一个task,都会用TaskRunner来封装task,然后从线程池里取出一个线程,执行这个task。

17、TaskRunner,将我们编写的代码,也就是要执行的算子以及函数,拷贝,反序列化,然后执行task。(Task,有两种,ShuffleMapTask和ResultTask,只有最后一个stage是ResultTask,之前的stage都是ShuffleMapTask)。

18、所以,最后整个spark应用程序的执行,就是stage分批次作为taskset提交到executor执行,每个task针对RDD的一个partition,执行我们定义的算子和函数,这些task在执行完对初始的RDD的算子和函数之后,会产生一个新的RDD,这批task如果在一个stage里面,他会继续执行我们对第二个RDD定义的算子和函数,然后以此类推,这个stage执行完以后会执行下一个stage,到job,直到所有操作执行完为止。

上述过程可简化为以下过程:

1、在main方法中初始化SparkContext,SparkContext(客户端)会向Master(也可以说是资源管理器)发送应用注册消息,并申请运行Executor资源(此处是standalone环境,如果是onYarn就是ResourceManager),Master会给Client返回一个注册结果,Client将该Application标注为已注册。

2、Master根据应用的资源,给选择Worker分配Executor资源并启动StandaloneExecutorBackend;

3、启动Executor后,Executor会向SparkContext(客户端)发送注册成功信息,同时将运行情况将随着心跳发送到Master上,并申请Task;

4、当SparkContext的RDD触发行动操作后,将创建RDD的DAG,通过DAGSchedule进行划分stage转化为TaskSet,并把Taskset发送给Task Scheduler;

5、Task Scheduler将Task发送给注册的Executor运行,同时SparkContext将应用程序代码发送给Executor,Excutor接收到任务消息后,启动并运行任务(也就是说任务是在Excutor中执行);

6、最后当所有任务运行时,有Driver处理结果并回收资源。(Driver来申请资源和回收资源)

代码流程图如下:

运行流程图如下:

Spark运行架构特点:

  • 每个Application获取专属的executor进 程,该进程在Application期间一直驻留,并以多线程方式运行tasks。这种Application隔离机制有其优势的,无论是从调度角度看 (每个Driver调度它自己的任务),还是从运行角度看(来自不同Application的Task运行在不同的JVM中)。当然,这也意味着 Spark Application不能跨应用程序共享数据,除非将数据写入到外部存储系统。
  • Spark与资源管理器无关,只要能够获取executor进程,并能保持相互通信就可以了。
  • 提 交SparkContext的Client应该靠近Worker节点(运行Executor的节点),最好是在同一个Rack里,因为Spark Application运行过程中SparkContext和Executor之间有大量的信息交换;如果想在远程集群中运行,最好使用RPC将 SparkContext提交给集群,不要远离Worker运行SparkContext。
  • Task采用了数据本地性和推测执行的优化机制。

代码流程:

a) 创建client向master注册Application的注册线程池

类:StandaloneAppClient在ClientEndpoint(ClientEndpoint为StandaloneAppClient的私有类)的tryRegisterAllMaster方法中创建注册线程池registerMasterThreadPool,在该线程池中启动注册线程并向Master发送RegisterApplication注册应用的消息,代码如下所示:

类:

/**

* Register with all masters asynchronously and returns an array `Future`s for cancellation.

*/

private def tryRegisterAllMasters(): Array[JFuture[_]] = {

// 由于HA等环节中有多个Master,需要遍历所有Master发送消息

for (masterAddress <- masterRpcAddresses) yield {

//向线程池中启动注册线程,当该线程读到应用注册成功标识registered=true时退出注册线程

registerMasterThreadPool.submit(new Runnable {

override def run(): Unit = try {

if (registered.get) {

return

}

logInfo("Connecting to master " + masterAddress.toSparkURL + "...")

//获取Master端的引用,发送注册应用消息

val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)

masterRef.send(RegisterApplication(appDescription, self))

} catch {

case ie: InterruptedException => // Cancelled

case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)

}

})

}

}

b)Master接收到Application注册信息,完成注册并返回Client,同时向Worker发送启动Executor的请求

Master 接收到注册应用的消息时,在registerApplication方法中记录应用

信息并把该应用加入到等待运行应用列表中,注册完毕后发送成功消息RegisterApplication给ClientEndpoint,同时调用startExecutorsOnWorkers方法运行应用。在执行前需要获取运行应用的Worker,然后发送LaunchExcutor消息给Worker,通知Worker启动Excutor,其中Master.startExcutorsOnWorkers方法如下:

/**

* Schedule and launch executors on workers

*/

private def startExecutorsOnWorkers(): Unit = {

// Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app

// in the queue, then the second app, etc.

//从app列表中使用FIFO调度算法运行应用,即先注册的应用先运行。

for (app <- waitingApps if app.coresLeft > 0) {

val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor

// Filter out workers that don't have enough resources to launch an executor

//在worker列表中,根据worker状态和资源信息过滤出需要运行应用的worker

val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)

.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&

worker.coresFree >= coresPerExecutor.getOrElse(1))

.sortBy(_.coresFree).reverse

//确定运行在哪些Worker上和每个Worker分类用于运行的核数,分配算法有两种,一种是把应用运行//在尽可能多的Worker上,另一种是运行在尽可能 少的Worker上

val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

// Now that we've decided how many cores to allocate on each worker, let's allocate them

for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {

//发送LaunchExecutor消息给Worker,通知Worker启动Executor。

allocateWorkerResourceToExecutors(

app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))

}

}

}

c) client 接收到Master返回的注册成功信息,完成注册Application

AppClient.ClientEndpoint接收到Master发送的RegisterApplication消息,需要把注册表示registered,置为true(表示已注册),Master注册线程获取状态变化后,完成注册Application进程,StandaloneAppClient.RegisteredApplication代码如下:

override def receive: PartialFunction[Any, Unit] = {

case RegisteredApplication(appId_, masterRef) =>

// FIXME How to handle the following cases?

// 1. A master receives multiple registrations and sends back multiple

// RegisteredApplications due to an unstable network.

// 2. Receive multiple RegisteredApplication from different masters because the master is

// changing.

appId.set(appId_)

registered.set(true)

master = Some(masterRef)

listener.connected(appId.get)

d) Worker的启动Executor的过程

在b)步骤中,在Master类的startExecutorsOnWorkers方法中分配资源运行应用程序时,调用allocateWorkerResourceToExecutors方法实现在Worker中启动Executor。当Worker收到Master发送过来的LaunchExecutor消息后,先实例化ExecutorRunner对象,在ExecutorRunner启动中,会创建进程生成器 ProcessBuilder,然后由该生成器使用command创建CoarseGrainedExecutorBackend对象,该对象是Executor运行的容器,最后Worker发送ExecutorStateChanged消息给Master,通知Executor已经创建完毕。

当Worker接收到启动Executor消息,执行代码如下:

case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>

if (masterUrl != activeMasterUrl) {

logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.")

} else {

try {

logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))

// 创建 Executor 执行目录

// Create the executor's working directory

val executorDir = new File(workDir, appId + "/" + execId)

if (!executorDir.mkdirs()) {

throw new IOException("Failed to create directory " + executorDir)

}

// Create local dirs for the executor. These are passed to the executor via the

// SPARK_EXECUTOR_DIRS environment variable, and deleted by the Worker when the

// application finishes.(通过SPARK_EXECUTOR_DIRS 环境变量,在worker中创建Executor执行目录,当程序执行完毕后由worker进行删除)

val appLocalDirs = appDirectories.getOrElse(appId, {

val localRootDirs = Utils.getOrCreateLocalRootDirs(conf)

val dirs = localRootDirs.flatMap { dir =>

try {

//创建执行目录

val appDir = Utils.createDirectory(dir, namePrefix = "executor")

//授权

Utils.chmod700(appDir)

Some(appDir.getAbsolutePath())

} catch {

case e: IOException =>

logWarning(s"${e.getMessage}. Ignoring this directory.")

None

}

}.toSeq

if (dirs.isEmpty) {

throw new IOException("No subfolder can be created in " +

s"${localRootDirs.mkString(",")}.")

}

dirs

})

appDirectories(appId) = appLocalDirs

//在ExecutorRunner中创建CoarseGrainedExecutorBackend对象,创建的是使用应用信息中的//command,而command是在SparkDeploySchedulerBackbend的start方法中构建

val manager = new ExecutorRunner(

appId,

execId,

appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),

cores_,

memory_,

self,

workerId,

host,

webUi.boundPort,

publicAddress,

sparkHome,

executorDir,

workerUri,

conf,

appLocalDirs, ExecutorState.RUNNING)

executors(appId + "/" + execId) = manager

manager.start()

coresUsed += cores_

memoryUsed += memory_

//向master发送消息,表示Executoor状态已经更改为ExecutorState.RUNNING

sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))

} catch {

case e: Exception =>

logError(s"Failed to launch executor $appId/$execId for ${appDesc.name}.", e)

if (executors.contains(appId + "/" + execId)) {

executors(appId + "/" + execId).kill()

executors -= appId + "/" + execId

}

sendToMaster(ExecutorStateChanged(appId, execId, ExecutorState.FAILED,

Some(e.toString), None))

}

}

在ExecutorRunner创建中调用了fetchAndRunExecutor方法进行实现,在该方法中command内容在SparkDeploySchedulerBackend中定义,指定构造Executor运行容器CoarseGrainedExecutorBackend,其创建过程如下所示:ExecutorRunner.fetchAndRunExecutor()

private def fetchAndRunExecutor() {

try {

// Launch the process

// 通过应用程序信息和环境配置创建构造器builder

val builder = CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),

memory, sparkHome.getAbsolutePath, substituteVariables)

val command = builder.command()

val formattedCommand = command.asScala.mkString("\"", "\" \"", "\"")

logInfo(s"Launch command: $formattedCommand")

// 在构造器builder中添加执行目录等信息

builder.directory(executorDir)

builder.environment.put("SPARK_EXECUTOR_DIRS", appLocalDirs.mkString(File.pathSeparator))

// In case we are running this from within the Spark Shell, avoid creating a "scala"

// parent process for the executor command

builder.environment.put("SPARK_LAUNCH_WITH_SCALA", "0")

//在构造器builder中添加监控页面输入日志地址信息

// Add webUI log urls

val baseUrl =

if (conf.getBoolean("spark.ui.reverseProxy", false)) {

s"/proxy/$workerId/logPage/?appId=$appId&executorId=$execId&logType="

} else {

s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="

}

builder.environment.put("SPARK_LOG_URL_STDERR", s"${baseUrl}stderr")

builder.environment.put("SPARK_LOG_URL_STDOUT", s"${baseUrl}stdout")

//启动构造器,创建CoarseGrainedExecutorBackkend实例

process = builder.start()

val header = "Spark Executor Command: %s\n%s\n\n".format(

formattedCommand, "=" * 40)

//输出CoarseGrainedExecutorBackkend实例的运行信息

// Redirect its stdout and stderr to files (正确信息)

val stdout = new File(executorDir, "stdout")

stdoutAppender = FileAppender(process.getInputStream, stdout, conf)

//错误信息

val stderr = new File(executorDir, "stderr")

Files.write(header, stderr, StandardCharsets.UTF_8)

stderrAppender = FileAppender(process.getErrorStream, stderr, conf)

// 等待CoarseGrainedExecutorBackkend运行结算书,当结束时向Worker发送退出状态信息

// Wait for it to exit; executor may exit with code 0 (when driver instructs it to shutdown)

// or with nonzero exit code

val exitCode = process.waitFor()

state = ExecutorState.EXITED

val message = "Command exited with code " + exitCode

worker.send(ExecutorStateChanged(appId, execId, state, Some(message), Some(exitCode)))

} catch {

case interrupted: InterruptedException =>

logInfo("Runner thread for executor " + fullId + " interrupted")

state = ExecutorState.KILLED

killProcess(None)

case e: Exception =>

logError("Error running executor", e)

state = ExecutorState.FAILED

killProcess(Some(e.toString))

}

}

e) Master接收到Worker发送的启动Executor完成的信息

Master接收到Worker发送的ExecutorStateChange消息,根据ExecutorState。

类Master

case ExecutorStateChanged(appId, execId, state, message, exitStatus) =>

val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))

execOption match {

case Some(exec) =>

val appInfo = idToApp(appId)

val oldState = exec.state

exec.state = state

if (state == ExecutorState.RUNNING) {

assert(oldState == ExecutorState.LAUNCHING,

s"executor $execId state transfer from $oldState to RUNNING is illegal")

appInfo.resetRetryCount()

}

// 向driver 发送ExecutorUpdated消息

exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus, false))

if (ExecutorState.isFinished(state)) {

// Remove this executor from the worker and app

logInfo(s"Removing executor ${exec.fullId} because it is $state")

// If an application has already finished, preserve its

// state to display its information properly on the UI

if (!appInfo.isFinished) {

appInfo.removeExecutor(exec)

}

exec.worker.removeExecutor(exec)

val normalExit = exitStatus == Some(0)

// Only retry certain number of times so we don't go into an infinite loop.

// Important note: this code path is not exercised by tests, so be very careful when

// changing this `if` condition.

if (!normalExit

&& appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES

&& MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path

val execs = appInfo.executors.values

if (!execs.exists(_.state == ExecutorState.RUNNING)) {

logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +

s"${appInfo.retryCount} times; removing it")

removeApplication(appInfo, ApplicationState.FAILED)

}

}

}

schedule()

case None =>

logWarning(s"Got status update for unknown executor $appId/$execId")

}

f) Executor启动后,会将Executor信息发送给Driver,Driver会返回确认消息,并发送LaunchTask消息执行任务。

在CoarseGrainedExecutorBackkend启动方法onStart中,会发送注册Executor消息RegisterExecutor给DriverEndPoint,在Driver端,先判断Executor是否已经注册,如果已经存在则发送注册失败RegisterExecutorFailed消息,否则Driver会记录该Executor信息,发送注册成功RegisterExecutor消息,在makeOffers()方法中分配运行任务资源,最后发送LaunchTask消息执行任务。

其中在Driver端进行注册的Executor的过程如下:

类:CoarseGrainedSchedulerBackend

case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls) =>

if (executorDataMap.contains(executorId)) {

executorRef.send(RegisterExecutorFailed("Duplicate executor ID: " + executorId))

context.reply(true)

} else {

// If the executor's rpc env is not listening for incoming connections, `hostPort`

// will be null, and the client connection should be used to contact the executor.

val executorAddress = if (executorRef.address != null) {

executorRef.address

} else {

context.senderAddress

}

logInfo(s"Registered executor $executorRef ($executorAddress) with ID $executorId")

//记录Executor

addressToExecutorId(executorAddress) = executorId

totalCoreCount.addAndGet(cores)

totalRegisteredExecutors.addAndGet(1)

val data = new ExecutorData(executorRef, executorRef.address, hostname,

cores, cores, logUrls)

// This must be synchronized because variables mutated

// in this block are read when requesting executors

//创建Executor编号和其具体信息的键值列表

CoarseGrainedSchedulerBackend.this.synchronized {

executorDataMap.put(executorId, data)

if (currentExecutorIdCounter < executorId.toInt) {

currentExecutorIdCounter = executorId.toInt

}

if (numPendingExecutors > 0) {

numPendingExecutors -= 1

logDebug(s"Decremented number of pending executors ($numPendingExecutors left)")

}

}

//回复Executor 完成注册消息并在监听总线中加入Executor事件

executorRef.send(RegisteredExecutor)

// Note: some tests expect the reply to come after we put the executor in the map

context.reply(true)

listenerBus.post(

SparkListenerExecutorAdded(System.currentTimeMillis(), executorId, data))

//分配运行任务资源并发送LaunchTask消息执行任务

makeOffers()

}

g) Executor接收到自己注册成功的消息后,会向Driver发送心跳,并等待任务

当CoarseGrainedExecutorBackend接收到Executor注册成功RegisterExecutor消息时,在CoarseGrainedExecutorBackend 容器是实例化Executor对象。启动完毕后,会向Driver定时发送心跳信息,等待接收从Driver端发送执行任务的消息

类CoarseGrainedExecutorBackend

case RegisteredExecutor =>

logInfo("Successfully registered with driver")

try {

//根据环境变量的参数启动Executor,在spark中塔是真正任务的执行者

executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)

} catch {

case NonFatal(e) =>

exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)

}

在 new Executor 该类中,定时向Driver发送心跳信息,等待Driver下发任务:

// Executor for the heartbeat task.

private val heartbeater = ThreadUtils.newDaemonSingleThreadScheduledExecutor("driver-heartbeater")

/**

* Schedules a task to report heartbeat and partial metrics for active tasks to driver.

*/

private def startDriverHeartbeater(): Unit = {

//设置间隔时间为10s

val intervalMs = conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s")

//等待随机的时间间隔,这样心跳在同步中不会结束

// Wait a random interval so the heartbeats don't end up in sync

val initialDelay = intervalMs + (math.random * intervalMs).asInstanceOf[Int]

val heartbeatTask = new Runnable() {

override def run(): Unit = Utils.logUncaughtExceptions(reportHeartBeat())

}

//发送心跳信息给Driver

heartbeater.scheduleAtFixedRate(heartbeatTask, initialDelay, intervalMs, TimeUnit.MILLISECONDS)

}

}

h) 执行任务的过程

CoarseGrainedExecutorBackend的Executor启动后,接收从Driver端发送LaunchTask执行任务消息,任务执行是在Executor的launchTask方法实现的。在执行时会创建TaskRunner进程,由该进程进行任务的处理,处理完毕后发送statusUpdate消息返回给CoarseGrainedExecutorBackend

类CoarseGrainedExecutorBackend通过Executor启动launchTask:

case LaunchTask(data) =>

if (executor == null) {

// 当executor没有成功启动时,输出异常日志并关闭

exitExecutor(1, "Received LaunchTask command but executor was null")

} else {

val taskDesc = ser.deserialize[TaskDescription](data.value)

logInfo("Got assigned task " + taskDesc.taskId)

//启动TaskRunner进程执行任务

executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,

taskDesc.name, taskDesc.serializedTask)

}

调用Executor的launchTask方法,在该方法中创建TaskRunner进程,然后把该进程加入到threadPool中,由Executor进行统一调度:

def launchTask(

context: ExecutorBackend,

taskId: Long,

attemptNumber: Int,

taskName: String,

serializedTask: ByteBuffer): Unit = {

val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,

serializedTask)

runningTasks.put(taskId, tr)

threadPool.execute(tr)

}

任务执行过程和获取执行结果。

i) 执行完任务后的过程

在TaskRunner执行任务完成时,会由向Driver端发送状态变更消息,当Driver接收到该消息时,调用TaskSchedulerImpl的statusUpdate方法,根据任务执行不同的结果进行处理,处理完毕后再给该Executor分配执行任务,其中,在Driver端处理状态变更代码如下:

类 CoarseGrainedSchedulerBackend

case StatusUpdate(executorId, taskId, state, data) =>

//调用TaskSchedulerImpl的statusUpdate()方法,根据任务执行不同的结果进行处理

scheduler.statusUpdate(taskId, state, data.value)

if (TaskState.isFinished(state)) {

executorDataMap.get(executorId) match {

//任务执行成功后,回收该Executor运行该 任务的cpu,再根据实际情况分配任务。

case Some(executorInfo) =>

executorInfo.freeCores += scheduler.CPUS_PER_TASK

makeOffers(executorId)

case None =>

// Ignoring the update since we don't know about the executor.

logWarning(s"Ignored task status update ($taskId state $state) " +

s"from unknown executor with ID $executorId")

}

}

仔细看以上代码,其实就可以看出Driver端的方法就是CoarseGrainedSchedulerBackend类的方法

代码类和方法的执行流程:

猜你喜欢

转载自blog.csdn.net/u012133048/article/details/85268482
今日推荐