前言
前面分析过了DAG Scheduler划分stage的过程【感兴趣的话可以看看DAG Scheduler划分stage的过程】,现在我们开始看看Task Scheduler是如何划分Task的。
注:本人使用的Spark源码版本为2.3.0
,IDE为IDEA2019,对源码感兴趣的同学可以点击这里下载源码包,直接解压用IDEA打开即可。
正文
1、计算Task的并行度
首先看DAG Scheduler提交TaskSet的方法,这个方法是submitMissingTasks(stage: Stage, jobId: Int)
方法。(这个方法把stage和jobID传进去,stage存的是stage的最后一个RDD,这个RDD可以通过血缘关系将前面的RDD序列化)
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// First figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
// Use the scheduling pool, job group, description, etc. from an ActiveJob associated
// with this Stage
val properties = jobIdToActiveJob(jobId).properties
runningStages += stage
可以看到,这里创建了一个partitionsToCompute
,通过findMissingPartitions()
来计算分区数。
override def findMissingPartitions(): Seq[Int] = {
mapOutputTrackerMaster
.findMissingPartitions(shuffleDep.shuffleId)
.getOrElse(0 until numPartitions)
}
// numPartitions的数量
val numPartitions = rdd.partitions.length
之前我们说过,一个Stage的Task的并行度,取决于最后一个RDD的分区数,原因就在这里。
1、计算任务运行的最佳location
现在task的分区数已经确定了,接下来就是如何将任务分配到各个节点上去,根据stage的类型,得到不同的location,通过getPreferredLocs(stage.rdd, id)
方法来获取任务的运行location。
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
def getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation] = {
getPreferredLocsInternal(rdd, partition, new HashSet)
}
## 这是一个递归方法,递归地调用前面的RDD,并获取最佳location
private def getPreferredLocsInternal(
rdd: RDD[_],
partition: Int,
visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
// If the partition has already been visited, no need to re-visit.
// This avoids exponential path exploration. SPARK-695
if (!visited.add((rdd, partition))) {
// Nil has already been returned for previously visited partitions.
return Nil
}
这里又调用了rdd.preferredLocations(rdd.partitions(partition))
方法来获取最佳location。
final def preferredLocations(split: Partition): Seq[String] = {
checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
getPreferredLocations(split)
}
}
可以看到,这里使用了getOrElse()来获取最佳location。我们看getPreferredLocations(split)
这个方法,它并没有被实现,但是它被HadoopRDD和ShuffledRDD实现了,我们看一下HadoopRDD的实现方法。
override def getPreferredLocations(split: Partition): Seq[String] = {
val hsplit = split.asInstanceOf[HadoopPartition].inputSplit.value
val locs = hsplit match {
case lsplit: InputSplitWithLocationInfo =>
HadoopRDD.convertSplitLocationInfo(lsplit.getLocationInfo)
case _ => None
}
locs.getOrElse(hsplit.getLocations.filter(_ != "localhost"))
}
在这里,获取到了要执行任务的location。
3、创建tasks
回到DAG Scheduler.scala。partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
我们最终得到了需要进行计算的最佳地址taskIdToLocations
。
再往下,创建了一个taskBinary = sc.broadcast(taskBinaryBytes)
,它是一个序列化对象,通过依赖将前面的RDD也一并序列化,并把这个taskBinary广播出去。
再往下,开始创建tasks
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
}
tasks的创建首先的判断stage的类型,如果是shufflemapstage,则创建ShuffleMapTask,否则创建Resulttask,这里将分区号、stageid,任务树,最佳位置等信息都传入构造方法中,创建出了一个task。
4、tasks提交
现在Task的创建工作也完成了,开始提交任务,首先判断Task的数量是否大于0,如果是,则创建一个taskset,将tasks、stageid,location等信息封装进去,然后由taskScheduler.submitTasks
进行tasksets的提交。
if (tasks.size > 0) {
logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
}
5、schedulableBuilder进行任务调度
taskset提交到TaskSchedulerImpl之后,先创建一个TaskSetManager
,然后由SchedulableBuilder创建的schedulableBuilder
来进行任务的调度,schedulableBuilder内部有一个 rootPool
的数据结构,我们可以把它看成是一个任务树,所有的任务创建时都会写入到这棵任务树中,调度的过程首先是将taskSet和manager放入schedulableBuilder的rootPool中。在往后,会调用 backend.reviveOffers()
方法。
override def submitTasks(taskSet: TaskSet) {
val tasks = taskSet.tasks
logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
this.synchronized {
val manager = createTaskSetManager(taskSet, maxTaskFailures)
val stage = taskSet.stageId
val stageTaskSets =
taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
stageTaskSets.foreach { case (_, ts) =>
ts.isZombie = true
}
stageTaskSets(taskSet.stageAttemptId) = manager
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
if (!isLocal && !hasReceivedTask) {
starvationTimer.scheduleAtFixedRate(new TimerTask() {
override def run() {
if (!hasLaunchedTask) {
logWarning("Initial job has not accepted any resources; " +
"check your cluster UI to ensure that workers are registered " +
"and have sufficient resources")
} else {
this.cancel()
}
}
}, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
}
hasReceivedTask = true
}
backend.reviveOffers()
}
backend.reviveOffers()会调用driverEndpoint.send(ReviveOffers)
方法。
override def receive: PartialFunction[Any, Unit] = {
case StatusUpdate(executorId, taskId, state, data) =>
scheduler.statusUpdate(taskId, state, data.value)
if (TaskState.isFinished(state)) {
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
executorInfo.freeCores += scheduler.CPUS_PER_TASK
makeOffers(executorId)
case None =>
// Ignoring the update since we don't know about the executor.
logWarning(s"Ignored task status update ($taskId state $state) " +
s"from unknown executor with ID $executorId")
}
}
case ReviveOffers =>
makeOffers()
这个方法最终会调用makOffers()。
private def makeOffers() {
// Make sure no executor is killed while some task is launching on it
val taskDescs = withLock {
// Filter out executors under killing
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map {
case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toIndexedSeq
scheduler.resourceOffers(workOffers)
}
if (!taskDescs.isEmpty) {
launchTasks(taskDescs)
}
}
在启动executors之前,先得到 activeExecutors,从activeExecutors里进行任务的分配,得到了一系列的workOffers,workOffers里封装了executorHost,即任务运行的excutorID,然后调用luanchTasks()方法。
6、任务发送给executor
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
for (task <- tasks.flatten) {
val serializedTask = TaskDescription.encode(task)
if (serializedTask.limit() >= maxRpcMessageSize) {
Option(scheduler.taskIdToTaskSetManager.get(task.taskId)).foreach { taskSetMgr =>
try {
var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
"spark.rpc.message.maxSize (%d bytes). Consider increasing " +
"spark.rpc.message.maxSize or using broadcast variables for large values."
msg = msg.format(task.taskId, task.index, serializedTask.limit(), maxRpcMessageSize)
taskSetMgr.abort(msg)
} catch {
case e: Exception => logError("Exception in error callback", e)
}
}
}
else {
val executorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK
logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
s"${executorData.executorHost}.")
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
}
}
}
这个方法里,将task的信息进行了一次序列化,然后将其发送给executorEndpoint端。接下来就是executor执行任务的过程了。
总结
本文主要讲解了Task Scheduler划分任务的过程,从task分区的获取最佳location的获取,再到task的序列化和workOffers的发送。接下来会介绍一下executor端任务的运行过程。