Analysis of spark scheduling process source code

Spark, as an excellent distributed cluster memory computing framework, provides a simple interface and rich rdd operators for developers to call. The reason why spark runs so fast is that on the one hand it is based on memory, and on the other hand, it divides jobs, stages and tasks and directly executes multiple operator operations on the same end according to the shuffle process of the operators. pipeline, which reduces the storage consumption of unnecessary intermediate processes. According to the spark scheduling process on the official website, we see the following figure:

Spark cluster components

This picture is very concise. It roughly describes that the Driver process communicates with the Cluster Manager (master, yarn, mesos, etc.) and registers the Application with the cluster manager. The cluster manager gets in touch with the worker nodes in the cluster according to the obtained Application description. , let the worker process start the executor process on the node, and there are multiple task threads inside the executor. The specific strategies for these resource allocations, such as executor-cores and task-cores, are specified or defaulted when submitting tasks. After the Executor is started, it will be registered to the Driver process in reverse. The Driver process divides the tasks into tasks, and finally distributes the tasks to the Executor (a thread pool is maintained inside the Executor) for execution. The Executor then puts these tasks into threads for execution. .

This is only a general description, but in fact, more detailed operations are performed inside the Driver process, such as stage division and task scheduling. This picture does not directly show i. Next, I will analyze it based on the knowledge I have learned and the related source code. Let's look at the following code

val conf = new SparkConf().setAppName("map").setMaster("local[2]")
var sc = new SparkContext (conf)
This is a simple scala code whose main function is to initialize SparkContext. Looking at the SparkContext source code, we found that there are two fields:

private var _taskScheduler: TaskScheduler = _

@volatile private var _dagScheduler: DAGScheduler = _

Continuing down we see

// Create and start the scheduler
    val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    _dagScheduler = new DAGScheduler(this)
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
    // constructor
    _taskScheduler.start()

The main function of this code is to create the TaskScheduler and DAGScheduler objects when creating the SparkContext object. Among them, the creation of the TaskScheduler object will also execute different initialization strategies according to the specified master parameters. This is mainly done by pattern matching the master parameters. In fact, we also understand it well, because the TaskScheduler is really responsible for the task task. distribution, so different operating modes will naturally have different strategies, while DAGScheduler is different. Its main responsibility is to divide jobs into stages, and then divide stages into tasksets. The work in this stage will not be affected by different operating modes. different.

At the end of _taskScheduler.start(), the TaskScheduler starts to communicate with the cluster resource manager (master, yarn, mesos) and registers the application, and the cluster resource manager starts the executor process on the worker node according to the obtained application information. So from here we can see that the working process of spark is to allocate resources for task execution first. This is also a little different from hadoop.

This is just the process of initializing SparkContext. Next, we execute an action operator. The code is as follows:

var sc = new SparkContext (conf)
val rdd = sc.parallelize(Array(1,2,3,4))
rdd.map(x=>x*10).foreach(x=>print(x+"\n"))
When the code is executed to the foreach operator, spark encapsulates the previous code into a job to start execution. View forahc source code:

def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}
The key is this sc.runJob method, we have been entering the inside of this runJob method, and finally found such a line of code in it:

dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
The dagScheduler here is created together inside SparkContext when we initialize SparkContext. Continue to trace this function and find that there is such a code inside:

val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
Inside submitJob, a series of check operations will be performed, such as ensuring that the task will not be started on a partition that does not exist. There is such a piece of code at the end of the method body:

eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
The post method actually calls the eventQueue.put(event) method, and eventQueue is an instance of LinkedBlockingDeque. Through this function call, the submitted job has actually been placed in the blocked event queue, waiting for execution. And eventProcessLoop is an instance of DAGSchedulerEventProcessLoop. Check the source code of this class and find:

override def onReceive(event: DAGSchedulerEvent): Unit = {
    val timerContext = timer.time()
    try {
      doOnReceive(event)
    } finally {
      timerContext.stop()
    }
 }
This method accepts a DAGSchedulerEvent parameter. In fact, this DAGSchedulerEvent is a trait, and the JobSubmitted instance object posted in the above code is itself mixed with this trait, so this method is essentially accepting an event from the blocking event queue.

Continue to view the doOnReceive(evenet) method in this method, the internal code is as follows:

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
    ........
}
A pattern match is used here. To match the received event, here our event type is JobSubmitted(.....), so the first matched code will be executed, check dagScheduler.handleJobSubmitted

The source code found that the method contains the following two lines of code:

finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
submitStage(finalStage)
The first line of code is used to find the last stage divided according to the width, so here we also see that the act of dividing the job into stages is actually performed by DAGScheduler.

Continue to track submitStage(finalStage) and find that the internal code is as follows:

private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        val missing = getMissingParentStages(stage).sortBy(_.id) // Find the parent stage of this stage
        logDebug("missing: " + missing)
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          submitMissingTasks(stage, jobId.get) //If the current stage does not contain a parent stage, start submitting this stage
        } else {
          for (parent <- missing) {
            submitStage(parent) //Recursive execution until the current stage does not contain the parent stage
          }
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }
From the above code, we can see that the stage division is actually a process from front to back, recursively executed until the current stage does not depend on the previous stage, then submit this stage and continue to track submitMissingTasks(stage, jobId.get ) method, and found that there is such a piece of code inside the method:

val tasks: Seq[Task[_]] = try {
      stage match {
        case stage: ShuffleMapStage =>
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = stage.rdd.partitions(id)
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, stage.latestInfo.taskMetrics, properties, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId)
          }

        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = stage.rdd.partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
          }
      }
    } catch {............}
Here, mainly through pattern matching, the type of the current stage is obtained, and the task is divided according to the corresponding strategy.

In the lower part of this code, we find the following code:

if (tasks.size > 0) {
      logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")  //打印信息
      ...........
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
}else ............
The main function of this code is to let the taskScheduler object submit the task task and let a series of tasks run. Different master strategies correspond to different taskScheduler objects, so they also correspond to different task submission strategies.

If the master you specify here is 'local[2]', the taskScheduler will directly submit the task to the local executor, and the executor will assign it to a specific thread for execution. If the specified mode is not local, then it may be transmitted over the network to achieve task submission in the future.


Finally, we can conclude:

When SparkContext is initialized, DAGScheduler and TaskScheduler objects are also initialized internally. After TaskScheduler is started, it communicates with the cluster resource manager to register the current application. The resource manager uses the obtained application information according to its own resource allocation algorithm. , notify the worker node to start the corresponding Executor, and after the Executor starts, it will be registered with the TaskScheduler in reverse. When an action operator operation is performed, DAGScheduler divides the job into Stages internally. The division strategy is recursively executed from back to front. The final Stage without parent Stage is submitted, and tasks are divided according to the different types of Stages. , store the final task set in the tasks variable, and then the TaskScheduler submits the taskSet and distributes it to different Executors.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325942040&siteId=291194637