spark executor task执行

Executor执行任务的起点是Executor的launchTask()方法。

val executorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK

logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
  s"${executorData.executorHost}.")

executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))

launchTask()方法的调用,以standalone为例子,其实是在CoarseGrainedSchedulerBackend类的launchTasks()中,通过网络远程将TaskDescription序列化后的对象传递至对应的executor准备对具体的任务进行执行。

def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
  val tr = new TaskRunner(context, taskDescription)
  runningTasks.put(taskDescription.taskId, tr)
  threadPool.execute(tr)
}

在Executor端,接收到的序列化后的TaskDescription反序列化具体的对象,并封装为TaskRunner,投入Executor的线程池执行。

TaskRunner实现了Runnable接口,其在线程池具体执行的代码逻辑就在其run()方法中。

Executor.taskDeserializationProps.set(taskDescription.properties)

updateDependencies(taskDescription.addedFiles, taskDescription.addedJars)
task = ser.deserialize[Task[Any]](
  taskDescription.serializedTask, Thread.currentThread.getContextClassLoader)
task.localProperties = taskDescription.properties
task.setTaskMemoryManager(taskMemoryManager)

在以上代码中,实现了task在executor执行之前的两个重要步骤。

 

在updateDependencies()方法中,加载了task具体需要的文件和依赖的jar包。

for ((name, timestamp) <- newFiles if currentFiles.getOrElse(name, -1L) < timestamp) {
  logInfo("Fetching " + name + " with timestamp " + timestamp)
  // Fetch file with useCache mode, close cache for local mode.
  Utils.fetchFile(name, new File(SparkFiles.getRootDirectory()), conf,
    env.securityManager, hadoopConf, timestamp, useCache = !isLocal)
  currentFiles(name) = timestamp
}

以文件为例子,会依次遍历需要的文件,如果当前的executor不包含该文件,则会尝试从task中对该文件包含的路径信息中尝试获取。

同理,相应缺失的必要jar包将会通过URLClassLoader加载到本地中。

task = ser.deserialize[Task[Any]](
  taskDescription.serializedTask, Thread.currentThread.getContextClassLoader)
task.localProperties = taskDescription.properties
task.setTaskMemoryManager(taskMemoryManager)

// If this task has been killed before we deserialized it, let's quit now. Otherwise,
// continue executing the task.
val killReason = reasonIfKilled
if (killReason.isDefined) {
  throw new TaskKilledException(killReason.get)
}

而后,将会具体从TaskDescription中序列化task实体,判断是否已经被kill掉,如果没有,则准备正式执行。

val res = task.run(
  taskAttemptId = taskId,
  attemptNumber = taskDescription.attemptNumber,
  metricsSystem = env.metricsSystem)
threwException = false
res

Task的run()方法具体会在具体的task执行runTask()方法,具体的类型有ResultTask和。

以ResultTask的runTask()方法为例子。

val threadMXBean = ManagementFactory.getThreadMXBean
val deserializeStartTime = System.currentTimeMillis()
val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime
} else 0L
val ser = SparkEnv.get.closureSerializer.newInstance()
val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](
  ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
_executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
_executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
} else 0L

func(context, rdd.iterator(partition, context))

最后,执行的func()方法会被反序列化,并通过rdd的iterator()方法获取对应分区的信息进行自定义数据处理。

比如之前文章提到的kafkaRdd,将在这里生成一个kafka分区遍历器,在调用next()方法的时候根据偏移量从kafka对应的topic分区获得数据进行处理。

 

而ShuffleMapTask的区别在于,在其runTask()方法的最后,将会把处理结果写到BlockManager,以便接下来的task以这些数据为基础进行下一波处理。

var writer: ShuffleWriter[Any, Any] = null
try {
  val manager = SparkEnv.get.shuffleManager
  writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
  writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
  writer.stop(success = true).get
} catch {
  case e: Exception =>
    try {
      if (writer != null) {
        writer.stop(success = false)
      }
    } catch {
      case e: Exception =>
        log.debug("Could not stop writer", e)
    }
    throw e
}

处理完的最后结果,对于ResultTask则为具体的数据,对于ShuffleMapTask来说则是数据在BlockManager上的具体存储信息,则会被封装为result并被序列化返回给driver端处理。

发布了141 篇原创文章 · 获赞 19 · 访问量 10万+

猜你喜欢

转载自blog.csdn.net/weixin_40318210/article/details/104078839