概要
(spark 版本为2.1.1)
应用程序(Application): 基于Spark的用户程序,包含了一个Driver Program 和集群中多个的Executor;
驱动程序(Driver Program):运行Application的main()函数并且创建SparkContext,通常用SparkContext代表Driver Program;
执行单元(Executor): 是为某Application运行在Worker Node上的一个进程,该进程负责运行Task,并且负责将数据存在内存或者磁盘上,每个Application都有各自独立的Executors;
集群管理程序(Cluster Manager): 在集群上获取资源的外部服务(例如:Standalone、Mesos或Yarn);
操作(Operation):作用于RDD的各种操作分为Transformation和Action;
角色
可以将spark的运行过程分为三块,以standalone为例:
1、客户端; 2、Master 3、Worker
再细分一下
1、客户端可以细化为: Driver、sparkContext
2、Master就是Master
3、worker可以细分为:Executor
再细分一下:
1、客户端中的sparkContext可以细分为:DAGScheduler、TaskScheduler
2、Master还是Master
3、Worker可以细分为:线程池、TaskRunner
画个图就是这么个意思:
具体过程用户提交应用程序时:
1、提交Spark任务,spark-submit提交application
2、使用spark-submit使用Standalone时会创建和构造一个DriverActor进程。
3、Driver执行编写的代码,执行到在main函数中创建sparkContext,构建Spark Application的运行环境。
4、SparkContext(对象),在初始化的时候,做的最重要的两件事情,就是构造出来DAGScheduler和TaskScheduler。
5、TaskScheduler(有自己的后台进程),实际上负责,通过它对应的一个后台进程,去连接Master,向Master注册Application。
6、Master,接收到Application注册的请求之后,Master会给Client返回一个注册结果,Client将该Application标注为已注册,并去连接Worker,会使用自己的资源调度算法,在spark集群的多个Worker上,为这个Application启动多个Executor(StandaloneExecutorBackend)。
7、Master通知Worker启动Executor。
8、Worker会为Applicator启动Executor。
9、Executor(进程),启动之后,会自己反向注册到这个Application对应的这个SparkContext里面的的TaskScheduler上去,这时TaskScheduler就知道自己服务于当前这个Application应用的Executor有哪些了,除此以外,Executor会向Master发送心跳信息,并申请Task(??????)。
10、所有Executor都反向注册到Driver上之后,Driver结束SparkContext初始化,会继续执行我们自己编写的代码。
11、每执行到一个action,就会创建一个job。
12、job,会提交给DAGScheduler。
13、DAGScheduler,会将job划分为多个stage,然后每个stage创建一个TaskSet。(stage,stage划分算法)。
14、每个TaskSet会提交给TaskScheduler。
15、TaskScheduler,会把TaskSet里每一个task提交到executor上执行。所以,之前哪些executor是注册到这个TaskScheduler上面来,那么TaskScheduler在接收到TaskSet的时候,就会把Task提交到那些executor上面去。(task分配算法)
16、Executor(进程),有一个线程池,每接收到一个task,都会用TaskRunner来封装task,然后从线程池里取出一个线程,执行这个task。
17、TaskRunner,将我们编写的代码,也就是要执行的算子以及函数,拷贝,反序列化,然后执行task。(Task,有两种,ShuffleMapTask和ResultTask,只有最后一个stage是ResultTask,之前的stage都是ShuffleMapTask)。
18、所以,最后整个spark应用程序的执行,就是stage分批次作为taskset提交到executor执行,每个task针对RDD的一个partition,执行我们定义的算子和函数,这些task在执行完对初始的RDD的算子和函数之后,会产生一个新的RDD,这批task如果在一个stage里面,他会继续执行我们对第二个RDD定义的算子和函数,然后以此类推,这个stage执行完以后会执行下一个stage,到job,直到所有操作执行完为止。
上述过程可简化为以下过程:
1、在main方法中初始化SparkContext,SparkContext(客户端)会向Master(也可以说是资源管理器)发送应用注册消息,并申请运行Executor资源(此处是standalone环境,如果是onYarn就是ResourceManager),Master会给Client返回一个注册结果,Client将该Application标注为已注册。
2、Master根据应用的资源,给选择Worker分配Executor资源并启动StandaloneExecutorBackend;
3、启动Executor后,Executor会向SparkContext(客户端)发送注册成功信息,同时将运行情况将随着心跳发送到Master上,并申请Task;
4、当SparkContext的RDD触发行动操作后,将创建RDD的DAG,通过DAGSchedule进行划分stage转化为TaskSet,并把Taskset发送给Task Scheduler;
5、Task Scheduler将Task发送给注册的Executor运行,同时SparkContext将应用程序代码发送给Executor,Excutor接收到任务消息后,启动并运行任务(也就是说任务是在Excutor中执行);
6、最后当所有任务运行时,有Driver处理结果并回收资源。(Driver来申请资源和回收资源)
代码流程图如下:
运行流程图如下:
Spark运行架构特点:
- 每个Application获取专属的executor进 程,该进程在Application期间一直驻留,并以多线程方式运行tasks。这种Application隔离机制有其优势的,无论是从调度角度看 (每个Driver调度它自己的任务),还是从运行角度看(来自不同Application的Task运行在不同的JVM中)。当然,这也意味着 Spark Application不能跨应用程序共享数据,除非将数据写入到外部存储系统。
- Spark与资源管理器无关,只要能够获取executor进程,并能保持相互通信就可以了。
- 提 交SparkContext的Client应该靠近Worker节点(运行Executor的节点),最好是在同一个Rack里,因为Spark Application运行过程中SparkContext和Executor之间有大量的信息交换;如果想在远程集群中运行,最好使用RPC将 SparkContext提交给集群,不要远离Worker运行SparkContext。
- Task采用了数据本地性和推测执行的优化机制。
代码流程:
a) 创建client向master注册Application的注册线程池
类:StandaloneAppClient在ClientEndpoint(ClientEndpoint为StandaloneAppClient的私有类)的tryRegisterAllMaster方法中创建注册线程池registerMasterThreadPool,在该线程池中启动注册线程并向Master发送RegisterApplication注册应用的消息,代码如下所示:
类:
/**
* Register with all masters asynchronously and returns an array `Future`s for cancellation.
*/
private def tryRegisterAllMasters(): Array[JFuture[_]] = {
// 由于HA等环节中有多个Master,需要遍历所有Master发送消息
for (masterAddress <- masterRpcAddresses) yield {
//向线程池中启动注册线程,当该线程读到应用注册成功标识registered=true时退出注册线程
registerMasterThreadPool.submit(new Runnable {
override def run(): Unit = try {
if (registered.get) {
return
}
logInfo("Connecting to master " + masterAddress.toSparkURL + "...")
//获取Master端的引用,发送注册应用消息
val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
masterRef.send(RegisterApplication(appDescription, self))
} catch {
case ie: InterruptedException => // Cancelled
case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
}
})
}
}
b)Master接收到Application注册信息,完成注册并返回Client,同时向Worker发送启动Executor的请求
Master 接收到注册应用的消息时,在registerApplication方法中记录应用
信息并把该应用加入到等待运行应用列表中,注册完毕后发送成功消息RegisterApplication给ClientEndpoint,同时调用startExecutorsOnWorkers方法运行应用。在执行前需要获取运行应用的Worker,然后发送LaunchExcutor消息给Worker,通知Worker启动Excutor,其中Master.startExcutorsOnWorkers方法如下:
/**
* Schedule and launch executors on workers
*/
private def startExecutorsOnWorkers(): Unit = {
// Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
// in the queue, then the second app, etc.
//从app列表中使用FIFO调度算法运行应用,即先注册的应用先运行。
for (app <- waitingApps if app.coresLeft > 0) {
val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
// Filter out workers that don't have enough resources to launch an executor
//在worker列表中,根据worker状态和资源信息过滤出需要运行应用的worker
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
worker.coresFree >= coresPerExecutor.getOrElse(1))
.sortBy(_.coresFree).reverse
//确定运行在哪些Worker上和每个Worker分类用于运行的核数,分配算法有两种,一种是把应用运行//在尽可能多的Worker上,另一种是运行在尽可能 少的Worker上
val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
// Now that we've decided how many cores to allocate on each worker, let's allocate them
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
//发送LaunchExecutor消息给Worker,通知Worker启动Executor。
allocateWorkerResourceToExecutors(
app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
}
}
}
c) client 接收到Master返回的注册成功信息,完成注册Application
AppClient.ClientEndpoint接收到Master发送的RegisterApplication消息,需要把注册表示registered,置为true(表示已注册),Master注册线程获取状态变化后,完成注册Application进程,StandaloneAppClient.RegisteredApplication代码如下:
override def receive: PartialFunction[Any, Unit] = {
case RegisteredApplication(appId_, masterRef) =>
// FIXME How to handle the following cases?
// 1. A master receives multiple registrations and sends back multiple
// RegisteredApplications due to an unstable network.
// 2. Receive multiple RegisteredApplication from different masters because the master is
// changing.
appId.set(appId_)
registered.set(true)
master = Some(masterRef)
listener.connected(appId.get)
d) Worker的启动Executor的过程
在b)步骤中,在Master类的startExecutorsOnWorkers方法中分配资源运行应用程序时,调用allocateWorkerResourceToExecutors方法实现在Worker中启动Executor。当Worker收到Master发送过来的LaunchExecutor消息后,先实例化ExecutorRunner对象,在ExecutorRunner启动中,会创建进程生成器 ProcessBuilder,然后由该生成器使用command创建CoarseGrainedExecutorBackend对象,该对象是Executor运行的容器,最后Worker发送ExecutorStateChanged消息给Master,通知Executor已经创建完毕。
当Worker接收到启动Executor消息,执行代码如下:
case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
if (masterUrl != activeMasterUrl) {
logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.")
} else {
try {
logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))
// 创建 Executor 执行目录
// Create the executor's working directory
val executorDir = new File(workDir, appId + "/" + execId)
if (!executorDir.mkdirs()) {
throw new IOException("Failed to create directory " + executorDir)
}
// Create local dirs for the executor. These are passed to the executor via the
// SPARK_EXECUTOR_DIRS environment variable, and deleted by the Worker when the
// application finishes.(通过SPARK_EXECUTOR_DIRS 环境变量,在worker中创建Executor执行目录,当程序执行完毕后由worker进行删除)
val appLocalDirs = appDirectories.getOrElse(appId, {
val localRootDirs = Utils.getOrCreateLocalRootDirs(conf)
val dirs = localRootDirs.flatMap { dir =>
try {
//创建执行目录
val appDir = Utils.createDirectory(dir, namePrefix = "executor")
//授权
Utils.chmod700(appDir)
Some(appDir.getAbsolutePath())
} catch {
case e: IOException =>
logWarning(s"${e.getMessage}. Ignoring this directory.")
None
}
}.toSeq
if (dirs.isEmpty) {
throw new IOException("No subfolder can be created in " +
s"${localRootDirs.mkString(",")}.")
}
dirs
})
appDirectories(appId) = appLocalDirs
//在ExecutorRunner中创建CoarseGrainedExecutorBackend对象,创建的是使用应用信息中的//command,而command是在SparkDeploySchedulerBackbend的start方法中构建
val manager = new ExecutorRunner(
appId,
execId,
appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
cores_,
memory_,
self,
workerId,
host,
webUi.boundPort,
publicAddress,
sparkHome,
executorDir,
workerUri,
conf,
appLocalDirs, ExecutorState.RUNNING)
executors(appId + "/" + execId) = manager
manager.start()
coresUsed += cores_
memoryUsed += memory_
//向master发送消息,表示Executoor状态已经更改为ExecutorState.RUNNING
sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))
} catch {
case e: Exception =>
logError(s"Failed to launch executor $appId/$execId for ${appDesc.name}.", e)
if (executors.contains(appId + "/" + execId)) {
executors(appId + "/" + execId).kill()
executors -= appId + "/" + execId
}
sendToMaster(ExecutorStateChanged(appId, execId, ExecutorState.FAILED,
Some(e.toString), None))
}
}
在ExecutorRunner创建中调用了fetchAndRunExecutor方法进行实现,在该方法中command内容在SparkDeploySchedulerBackend中定义,指定构造Executor运行容器CoarseGrainedExecutorBackend,其创建过程如下所示:ExecutorRunner.fetchAndRunExecutor()
private def fetchAndRunExecutor() {
try {
// Launch the process
// 通过应用程序信息和环境配置创建构造器builder
val builder = CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),
memory, sparkHome.getAbsolutePath, substituteVariables)
val command = builder.command()
val formattedCommand = command.asScala.mkString("\"", "\" \"", "\"")
logInfo(s"Launch command: $formattedCommand")
// 在构造器builder中添加执行目录等信息
builder.directory(executorDir)
builder.environment.put("SPARK_EXECUTOR_DIRS", appLocalDirs.mkString(File.pathSeparator))
// In case we are running this from within the Spark Shell, avoid creating a "scala"
// parent process for the executor command
builder.environment.put("SPARK_LAUNCH_WITH_SCALA", "0")
//在构造器builder中添加监控页面输入日志地址信息
// Add webUI log urls
val baseUrl =
if (conf.getBoolean("spark.ui.reverseProxy", false)) {
s"/proxy/$workerId/logPage/?appId=$appId&executorId=$execId&logType="
} else {
s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="
}
builder.environment.put("SPARK_LOG_URL_STDERR", s"${baseUrl}stderr")
builder.environment.put("SPARK_LOG_URL_STDOUT", s"${baseUrl}stdout")
//启动构造器,创建CoarseGrainedExecutorBackkend实例
process = builder.start()
val header = "Spark Executor Command: %s\n%s\n\n".format(
formattedCommand, "=" * 40)
//输出CoarseGrainedExecutorBackkend实例的运行信息
// Redirect its stdout and stderr to files (正确信息)
val stdout = new File(executorDir, "stdout")
stdoutAppender = FileAppender(process.getInputStream, stdout, conf)
//错误信息
val stderr = new File(executorDir, "stderr")
Files.write(header, stderr, StandardCharsets.UTF_8)
stderrAppender = FileAppender(process.getErrorStream, stderr, conf)
// 等待CoarseGrainedExecutorBackkend运行结算书,当结束时向Worker发送退出状态信息
// Wait for it to exit; executor may exit with code 0 (when driver instructs it to shutdown)
// or with nonzero exit code
val exitCode = process.waitFor()
state = ExecutorState.EXITED
val message = "Command exited with code " + exitCode
worker.send(ExecutorStateChanged(appId, execId, state, Some(message), Some(exitCode)))
} catch {
case interrupted: InterruptedException =>
logInfo("Runner thread for executor " + fullId + " interrupted")
state = ExecutorState.KILLED
killProcess(None)
case e: Exception =>
logError("Error running executor", e)
state = ExecutorState.FAILED
killProcess(Some(e.toString))
}
}
e) Master接收到Worker发送的启动Executor完成的信息
Master接收到Worker发送的ExecutorStateChange消息,根据ExecutorState。
类Master
case ExecutorStateChanged(appId, execId, state, message, exitStatus) =>
val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
execOption match {
case Some(exec) =>
val appInfo = idToApp(appId)
val oldState = exec.state
exec.state = state
if (state == ExecutorState.RUNNING) {
assert(oldState == ExecutorState.LAUNCHING,
s"executor $execId state transfer from $oldState to RUNNING is illegal")
appInfo.resetRetryCount()
}
// 向driver 发送ExecutorUpdated消息
exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus, false))
if (ExecutorState.isFinished(state)) {
// Remove this executor from the worker and app
logInfo(s"Removing executor ${exec.fullId} because it is $state")
// If an application has already finished, preserve its
// state to display its information properly on the UI
if (!appInfo.isFinished) {
appInfo.removeExecutor(exec)
}
exec.worker.removeExecutor(exec)
val normalExit = exitStatus == Some(0)
// Only retry certain number of times so we don't go into an infinite loop.
// Important note: this code path is not exercised by tests, so be very careful when
// changing this `if` condition.
if (!normalExit
&& appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
&& MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
val execs = appInfo.executors.values
if (!execs.exists(_.state == ExecutorState.RUNNING)) {
logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
s"${appInfo.retryCount} times; removing it")
removeApplication(appInfo, ApplicationState.FAILED)
}
}
}
schedule()
case None =>
logWarning(s"Got status update for unknown executor $appId/$execId")
}
f) Executor启动后,会将Executor信息发送给Driver,Driver会返回确认消息,并发送LaunchTask消息执行任务。
在CoarseGrainedExecutorBackkend启动方法onStart中,会发送注册Executor消息RegisterExecutor给DriverEndPoint,在Driver端,先判断Executor是否已经注册,如果已经存在则发送注册失败RegisterExecutorFailed消息,否则Driver会记录该Executor信息,发送注册成功RegisterExecutor消息,在makeOffers()方法中分配运行任务资源,最后发送LaunchTask消息执行任务。
其中在Driver端进行注册的Executor的过程如下:
类:CoarseGrainedSchedulerBackend
case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls) =>
if (executorDataMap.contains(executorId)) {
executorRef.send(RegisterExecutorFailed("Duplicate executor ID: " + executorId))
context.reply(true)
} else {
// If the executor's rpc env is not listening for incoming connections, `hostPort`
// will be null, and the client connection should be used to contact the executor.
val executorAddress = if (executorRef.address != null) {
executorRef.address
} else {
context.senderAddress
}
logInfo(s"Registered executor $executorRef ($executorAddress) with ID $executorId")
//记录Executor
addressToExecutorId(executorAddress) = executorId
totalCoreCount.addAndGet(cores)
totalRegisteredExecutors.addAndGet(1)
val data = new ExecutorData(executorRef, executorRef.address, hostname,
cores, cores, logUrls)
// This must be synchronized because variables mutated
// in this block are read when requesting executors
//创建Executor编号和其具体信息的键值列表
CoarseGrainedSchedulerBackend.this.synchronized {
executorDataMap.put(executorId, data)
if (currentExecutorIdCounter < executorId.toInt) {
currentExecutorIdCounter = executorId.toInt
}
if (numPendingExecutors > 0) {
numPendingExecutors -= 1
logDebug(s"Decremented number of pending executors ($numPendingExecutors left)")
}
}
//回复Executor 完成注册消息并在监听总线中加入Executor事件
executorRef.send(RegisteredExecutor)
// Note: some tests expect the reply to come after we put the executor in the map
context.reply(true)
listenerBus.post(
SparkListenerExecutorAdded(System.currentTimeMillis(), executorId, data))
//分配运行任务资源并发送LaunchTask消息执行任务
makeOffers()
}
g) Executor接收到自己注册成功的消息后,会向Driver发送心跳,并等待任务
当CoarseGrainedExecutorBackend接收到Executor注册成功RegisterExecutor消息时,在CoarseGrainedExecutorBackend 容器是实例化Executor对象。启动完毕后,会向Driver定时发送心跳信息,等待接收从Driver端发送执行任务的消息
类CoarseGrainedExecutorBackend
case RegisteredExecutor =>
logInfo("Successfully registered with driver")
try {
//根据环境变量的参数启动Executor,在spark中塔是真正任务的执行者
executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
} catch {
case NonFatal(e) =>
exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
}
在 new Executor 该类中,定时向Driver发送心跳信息,等待Driver下发任务:
// Executor for the heartbeat task.
private val heartbeater = ThreadUtils.newDaemonSingleThreadScheduledExecutor("driver-heartbeater")
/**
* Schedules a task to report heartbeat and partial metrics for active tasks to driver.
*/
private def startDriverHeartbeater(): Unit = {
//设置间隔时间为10s
val intervalMs = conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s")
//等待随机的时间间隔,这样心跳在同步中不会结束
// Wait a random interval so the heartbeats don't end up in sync
val initialDelay = intervalMs + (math.random * intervalMs).asInstanceOf[Int]
val heartbeatTask = new Runnable() {
override def run(): Unit = Utils.logUncaughtExceptions(reportHeartBeat())
}
//发送心跳信息给Driver
heartbeater.scheduleAtFixedRate(heartbeatTask, initialDelay, intervalMs, TimeUnit.MILLISECONDS)
}
}
h) 执行任务的过程
CoarseGrainedExecutorBackend的Executor启动后,接收从Driver端发送LaunchTask执行任务消息,任务执行是在Executor的launchTask方法实现的。在执行时会创建TaskRunner进程,由该进程进行任务的处理,处理完毕后发送statusUpdate消息返回给CoarseGrainedExecutorBackend
类CoarseGrainedExecutorBackend通过Executor启动launchTask:
case LaunchTask(data) =>
if (executor == null) {
// 当executor没有成功启动时,输出异常日志并关闭
exitExecutor(1, "Received LaunchTask command but executor was null")
} else {
val taskDesc = ser.deserialize[TaskDescription](data.value)
logInfo("Got assigned task " + taskDesc.taskId)
//启动TaskRunner进程执行任务
executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
taskDesc.name, taskDesc.serializedTask)
}
调用Executor的launchTask方法,在该方法中创建TaskRunner进程,然后把该进程加入到threadPool中,由Executor进行统一调度:
def launchTask(
context: ExecutorBackend,
taskId: Long,
attemptNumber: Int,
taskName: String,
serializedTask: ByteBuffer): Unit = {
val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,
serializedTask)
runningTasks.put(taskId, tr)
threadPool.execute(tr)
}
任务执行过程和获取执行结果。
i) 执行完任务后的过程
在TaskRunner执行任务完成时,会由向Driver端发送状态变更消息,当Driver接收到该消息时,调用TaskSchedulerImpl的statusUpdate方法,根据任务执行不同的结果进行处理,处理完毕后再给该Executor分配执行任务,其中,在Driver端处理状态变更代码如下:
类 CoarseGrainedSchedulerBackend
case StatusUpdate(executorId, taskId, state, data) =>
//调用TaskSchedulerImpl的statusUpdate()方法,根据任务执行不同的结果进行处理
scheduler.statusUpdate(taskId, state, data.value)
if (TaskState.isFinished(state)) {
executorDataMap.get(executorId) match {
//任务执行成功后,回收该Executor运行该 任务的cpu,再根据实际情况分配任务。
case Some(executorInfo) =>
executorInfo.freeCores += scheduler.CPUS_PER_TASK
makeOffers(executorId)
case None =>
// Ignoring the update since we don't know about the executor.
logWarning(s"Ignored task status update ($taskId state $state) " +
s"from unknown executor with ID $executorId")
}
}
仔细看以上代码,其实就可以看出Driver端的方法就是CoarseGrainedSchedulerBackend类的方法
代码类和方法的执行流程: