Spark- job execution process overview

spark jobs and tasks of the system with oh its core, it is possible to effectively perform the scheduling task division is the fundamental reason DAG and fault tolerance, such that its bottom to the top of calls between the various modules and processes obvious ease.

Related Terms

Jobs ( Job ): RDD one or more scheduling operation stages by the action generated

Scheduling stage ( Stage ): each job because RDD resolving dependencies among a plurality of sets into a set of tasks, referred to as scheduling stage, also called task set ( taskset ). Divided by the scheduling stage is DAGScheduler to divide the scheduling stage there Shuffle Map Stage and Result Stage two kinds.

Tasks ( Task ) : distributed to the Executor task on, it is spark the smallest unit of the actual execution of the application.

DAGScheduler : DAGScheduler is scheduled for the stage of the task scheduler, responsible for receiving spark job applications submitted in accordance with RDD dependency division scheduling stage and submitted to the scheduling stage TaskScheduler.

TaskScheduler : TaskScheduler is task-oriented scheduler, which accepts DAGScheduler scheduling phase commit over, and then distribute the task to work node run by the Worker nodes Executor to run this task

Process Overview

1.spark application of various conversion operation, trigger job runs through action operation. After submitting The RDD build dependencies between DAG FIG, DAG FIG submitted to DAGScheduler resolution.

2.DAGScheduler is scheduled for the stage of high-level scheduler, DAGScheduler the DAG split into interdependent scheduling stage, the stage is split RDD whether reliance is dependent on a wide, wide rely on when faced with a new schedule is divided into stage. Each stage comprises one or more scheduling tasks, which are formed set of tasks submitted to the underlying scheduler TaskScheduler performs scheduling operation. DAGScheduler record which RDD is saved to disk and other physical and chemical operations, while seeking to optimize the scheduling of tasks, such as: local data and so on; DAGScheduler scheduling stage process monitoring operation, if an operation scheduling phase fails, you will need to resubmit the scheduling phase .

3. Each TaskScheduler only a SparkContext instance of the service, TaskScheduler receive DAGScheduler set of tasks sent from, TaskScheduler after receiving the task set in charge of the task set in the form of a distribution to the task of a cluster Worker nodes Executor to go running. If a task fails, TaskScheduler to be responsible and try again. In addition, if TaskScheduler find a task has not finished running, you may start to run the same task with a task, which finished first task run result with which the task.

4.Worker the Executor receive TaskScheduler the task is sent over to run multiple threads, each thread is responsible for a task. After the task runs to be returned to TaskScheduler , different types of tasks, returned in different ways. ShuffleMapTask returns a MapStatus object, rather than the result itself; ResultTask Depending on the results returned by the way can be divided into two categories.